下面我来为你详细讲解“Python爬取微信文章”的攻略。
本文主要借助Python第三方库beautifulsoup4和requests实现微信公众号文章的爬取。
步骤一:获取微信公众号的历史消息链接
要想爬取微信公众号的文章,首先需要获取该公众号最新或历史消息链接,可以在微信公众平台上手动获取,或者使用第三方API获取。
步骤二:获取每篇文章的链接
通过历史消息链接可以获取公众号中已经发布的所有文章,具体可以使用beautifulsoup4解析历史消息页面,获取所有的文章链接。
import requests
from bs4 import BeautifulSoup
url = "历史消息链接"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
items = soup.find_all(class_="news-list-item")
urls = []
for item in items:
urls.append(item.find("a")["href"])
步骤三:爬取每篇文章的内容
获取每篇文章链接后,就可以通过requests获取文章的html源代码,然后使用beautifulsoup4解析文章页面,获取文章标题、作者、发布时间、阅读量、点赞量、评论数等信息。另外,由于微信公众号的文章内容大多经过了加密处理,可以通过正则表达式匹配方式解密文章内容。
import re
url = "文章链接"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# 获取文章标题
title = soup.find("h2", class_="rich_media_title").get_text().strip()
# 获取文章作者和发布时间
meta_content = soup.find("div", class_="rich_media_meta_list").get_text().strip()
pattern = re.compile(r"(.*)\s+(.*)")
result = re.match(pattern, meta_content)
author = result.group(1)
publish_time = result.group(2)
# 获取文章阅读量、点赞量、评论数等信息
read_num = soup.find("span", class_="read_num").get_text().strip()
like_num = soup.find("span", class_="like_num").get_text().strip()
comment_num = soup.find("span", class_="comment_num").get_text().strip()
# 获取文章内容(解密)
content = soup.find("div", class_="rich_media_content").get_text()
pattern = re.compile(r"var\s*sg\:\s*\'(.*?)\';")
result = re.search(pattern, response.text)
if result:
sg_data = result.group(1)
content = decode_article(sg_data, content)
print(title, author, publish_time, read_num, like_num, comment_num, content)
其中,decode_article函数可以使用如下代码实现:
def decode_article(sg_data, encrypted_data):
"""
解密文章内容
:param sg_data: sg参数
:param encrypted_data: 加密的文章内容
:return: 解密后的文章内容
"""
key = hashlib.md5(sg_data.encode("utf-8")).hexdigest()
length = len(encrypted_data)
dec = ''
for i in range(length):
key_c = key[i % 32]
dec_c = chr(ord(key_c) ^ ord(encrypted_data[i]))
dec += dec_c
return dec
示例1:爬取“Python之禅”公众号的最新10篇文章
代码如下:
import requests
from bs4 import BeautifulSoup
import re
import hashlib
def decode_article(sg_data, encrypted_data):
"""
解密文章内容
:param sg_data: sg参数
:param encrypted_data: 加密的文章内容
:return: 解密后的文章内容
"""
key = hashlib.md5(sg_data.encode("utf-8")).hexdigest()
length = len(encrypted_data)
dec = ''
for i in range(length):
key_c = key[i % 32]
dec_c = chr(ord(key_c) ^ ord(encrypted_data[i]))
dec += dec_c
return dec
url = "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzUxNjQzMTEzMg==&scene=124&#wechat_redirect"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
items = soup.find_all(class_="weui-msg")
urls = []
for item in items:
urls.append(item.find("a")["href"])
url_prefix = "https://mp.weixin.qq.com"
articles = []
for i in range(10):
article = {}
url = url_prefix + urls[i]
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
article["title"] = soup.find("h2", class_="rich_media_title").get_text().strip()
meta_content = soup.find("div", class_="rich_media_meta_list").get_text().strip()
pattern = re.compile(r"(.*)\s+(.*)")
result = re.match(pattern, meta_content)
article["author"] = result.group(1)
article["pub_time"] = result.group(2)
article["read_num"] = soup.find("span", class_="read_num").get_text().strip()
article["like_num"] = soup.find("span", class_="like_num").get_text().strip()
article["comment_num"] = soup.find("span", class_="comment_num").get_text().strip()
content = soup.find("div", class_="rich_media_content").get_text()
pattern = re.compile(r"var\s*sg\:\s*\'(.*?)\';")
result = re.search(pattern, response.text)
if result:
sg_data = result.group(1)
content = decode_article(sg_data, content)
article["content"] = content.strip()
articles.append(article)
for article in articles:
print(article)
示例2:爬取“骚操作 Python 课”的所有文章
代码如下:
import requests
from bs4 import BeautifulSoup
import re
import hashlib
def decode_article(sg_data, encrypted_data):
"""
解密文章内容
:param sg_data: sg参数
:param encrypted_data: 加密的文章内容
:return: 解密后的文章内容
"""
key = hashlib.md5(sg_data.encode("utf-8")).hexdigest()
length = len(encrypted_data)
dec = ''
for i in range(length):
key_c = key[i % 32]
dec_c = chr(ord(key_c) ^ ord(encrypted_data[i]))
dec += dec_c
return dec
url = "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzA3OTk1MjU0Mw==&scene=124&#wechat_redirect"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
items = soup.find_all(class_="weui-msg")
urls = []
for item in items:
urls.append(item.find("a")["href"])
url_prefix = "https://mp.weixin.qq.com"
articles = []
for i in range(len(urls)):
article = {}
url = url_prefix + urls[i]
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
article["title"] = soup.find("h2", class_="rich_media_title").get_text().strip()
meta_content = soup.find("div", class_="rich_media_meta_list").get_text().strip()
pattern = re.compile(r"(.*)\s+(.*)")
result = re.match(pattern, meta_content)
article["author"] = result.group(1)
article["pub_time"] = result.group(2)
article["read_num"] = soup.find("span", class_="read_num").get_text().strip()
article["like_num"] = soup.find("span", class_="like_num").get_text().strip()
article["comment_num"] = soup.find("span", class_="comment_num").get_text().strip()
content = soup.find("div", class_="rich_media_content").get_text()
pattern = re.compile(r"var\s*sg\:\s*\'(.*?)\';")
result = re.search(pattern, response.text)
if result:
sg_data = result.group(1)
content = decode_article(sg_data, content)
article["content"] = content.strip()
articles.append(article)
for article in articles:
print(article)
以上就是“Python爬取微信文章”的完整攻略,希望能够对你有所帮助。
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:python 爬取微信文章 - Python技术站