详解用python写网络爬虫-爬取新浪微博评论

“详解用python写网络爬虫-爬取新浪微博评论”是一篇介绍如何使用Python实现爬取新浪微博评论的攻略，以下是完整的详解过程：

1.获得Cookie和User-Agent

首先需要获取新浪微博的Cookie和User-Agent，在浏览器中登陆新浪微博账号，按下F12调出控制台，在console中输入

console.log(document.cookie);

可以得到http请求中需要的Cookie，然后在Network面板中勾选Preserve log，刷新页面，查找第一个请求，复制头部中的User-Agent即可。将这两个信息复制到Python代码中。

2.获取每条微博ID

在这篇攻略中，我们使用的是requests库发送网络请求，因此需要先安装requests库。进行如下操作：

pip install requests

接着，我们需要获取每条微博的ID。在代码中，我们首先需要定义一个函数，用于获取每条微博的ID列表：

import requests
from lxml import etree

def get_weibo_ids(url):
    weibo_ids = []
    headers = {
        'User-Agent': 'YOUR USER AGENT',
        'Cookie': 'YOUR COOKIE',
    }
    response = requests.get(url, headers=headers)
    html = etree.HTML(response.text)
    weibo_elements = html.xpath('//div[@class="card-wrap"]')
    for weibo_element in weibo_elements:
        weibo_id = weibo_element.xpath('./@mid')[0]
        weibo_ids.append(weibo_id)
    return weibo_ids

这个函数从参数中的URL获取新浪微博页面的HTML内容，使用etree库解析HTML内容，然后使用XPath表达式获取每条微博的ID。需要注意的是，这里使用了一个XPath表达式‘//div[@class="card-wrap"]’来获取微博元素，这个表达式在新浪微博HTML中只对应一条微博元素。因此，每次执行get_weibo_ids函数，我们只能获取一页（即一定数量的微博）的微博ID。

接下来，我们可以编写代码来调用get_weibo_ids，比如获取第一页的ID：

url = 'https://weibo.com/someone/profile?is_hot=1'
weibo_ids = get_weibo_ids(url)
print(weibo_ids)

这里需要将url参数修改为你要爬取的新浪微博用户的主页链接，is_hot参数表示只获取热门微博。执行代码，就可以看到获取到的微博ID列表了。

3.获取每条微博评论

使用get_weibo_ids函数获取到微博ID列表之后，接着我们需要编写函数来获取每条微博的评论。代码如下：

def get_weibo_comments(weibo_id):
    weibo_comments = []
    page = 1
    while True:
        url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={}&from=singleWeiBo&page={}'.format(weibo_id, page)
        headers = {
            'User-Agent': 'YOUR USER AGENT',
            'Cookie': 'YOUR COOKIE',
        }
        response = requests.get(url, headers=headers)
        json_data = response.json()
        html = json_data['data']['html']
        if not html:
            break
        page += 1
        tree = etree.HTML(html)
        comments_elements = tree.xpath("//div[@class='WB_text']")
        for comment_element in comments_elements:
            if comment_element.xpath(".//a[@suda-data]"):
                weibo_comments.append(comment_element.xpath('string(.)').strip())
    return weibo_comments

对于每条微博ID，我们通过while循环获取该微博的所有评论。在爬取评论时，需要向一个AJAX接口发送请求，使用requests库获取数据即可。需要注意的是，该接口需要传入微博ID和评论页数。因此，循环中需要确保每次请求的页数在递增。

接下来，我们可以编写代码来调用get_weibo_comments函数，获取一条微博的评论：

weibo_id = '1234567890'
weibo_comments = get_weibo_comments(weibo_id)
print(weibo_comments)

将weibo_id参数替换为你要爬取微博的ID，并执行代码，就可以看到获取到的评论了。

总结

以上就是使用Python编写网络爬虫爬取新浪微博评论的详细攻略。其中，我们需要获取Cookie和User-Agent，获取每条微博的ID以及获取每条微博的评论。对于三个步骤，代码中都有详细注释和示例，可以依照代码进行实现。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：详解用python写网络爬虫-爬取新浪微博评论 - Python技术站

详解用python写网络爬虫-爬取新浪微博评论

1.获得Cookie和User-Agent

2.获取每条微博ID

3.获取每条微博评论

总结

相关文章