Python爬虫框架Scrapy基本应用学习教程

简介

Scrapy是Python的一个强大的、灵活的、高效的开源网络爬虫框架。它用于从网站上获取有价值的数据，支持处理静态和动态网页，支持多级页面的爬取，可实现高效、快速、可靠的数据获取。同时，Scrapy提供了很多方便的工具，如合理的数据结构、快速的HTML/XML解析、多线程等，简化了爬取网站数据的过程。

这个教程将介绍Scrapy的基本应用，包括创建爬虫、添加爬取规则、处理爬取数据、数据存储等。本教程将使用Python 3.x版本和Scrapy 2.x版本。

爬虫创建

Scrapy使用命令行工具创建爬虫。可在命令行输入以下命令创建一个名为“example_spider”的爬虫：

scrapy genspider example_spider example.com

“example.com”是我们要爬取的目标网站。

爬虫创建完毕后，可以在生成的“example_spider.py”文件中编写代码。

添加爬取规则

爬虫可以通过添加规则来确定要爬取的URL和内容。例如：

class ExampleSpider(scrapy.Spider):
    name = 'example_spider'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        Rule(LinkExtractor(allow=r'category\.php'), follow=True),
        Rule(LinkExtractor(allow=r'article\.php'), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        pass

以上代码中，我们添加了两个规则。第一个规则用于匹配URL中包含“category.php”的链接，并追踪这些链接；第二个规则用于匹配URL中包含“article.php”的链接，当这些链接被访问时，调用“parse_item”函数来处理响应的数据，并停止追踪新链接。

处理爬取数据

在Scrapy中，爬取的数据可以通过对响应的HTTP请求进行解析得到。一般情况下，我们可以使用XPath或CSS选择器来从HTML或XML文档中提取所需信息。

class ExampleSpider(scrapy.Spider):
    name = 'example_spider'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    def parse(self, response):
        for sel in response.xpath('//div[@class="product-info"]'):
            item = {}
            item['title'] = sel.xpath('h2/a/text()').extract_first()
            item['link'] = sel.xpath('h2/a/@href').extract_first()
            item['description'] = sel.xpath('text()').extract_first()
            yield item

以上代码中，我们通过XPath选择器提取了HTML文档中的“title”、“link”和“description”信息。

数据存储

Scrapy已经支持多种数据存储格式。例如，我们可以将数据存储到JSON文件中：

class ExampleSpider(scrapy.Spider):
    name = 'example_spider'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    def parse(self, response):
        items = []
        for sel in response.xpath('//div[@class="product-info"]'):
            item = {}
            item['title'] = sel.xpath('h2/a/text()').extract_first()
            item['link'] = sel.xpath('h2/a/@href').extract_first()
            item['description'] = sel.xpath('text()').extract_first()
            items.append(item)
        return items

    def closed(self, reason):
        with open('items.json', 'w') as f:
            json.dump(self.items, f)

以上代码中，我们将提取到的数据存储到了“items.json”文件中。

示例说明

以下给出两个Scrapy爬虫的示例。

示例1：提取糗事百科网站的段子

import scrapy

class QiuBaiSpider(scrapy.Spider):
    name = 'qiubai'
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        for article in response.css('div.article'):
            yield {
                'author': article.css('h2::text').get().strip(),
                'content': article.css('div.content span::text').getall(),
                'url': response.urljoin(article.css('a.contentHerf::attr(href)').get())
            }

        next_url = response.css('ul.pagination li:last-child a::attr(href)').get()
        if next_url is not None:
            yield response.follow(next_url, callback=self.parse)

该爬虫可以从糗事百科网站提取段子的作者、内容和URL，并支持翻页。

示例2：提取腾讯新闻网站的科技新闻

import scrapy

class TencentSpider(scrapy.Spider):
    name = 'tencent'
    start_urls = ['https://tech.qq.com/']

    def parse(self, response):
        for article in response.css('div.flashPicContainer > .item'):
            yield {
                'title': article.css('a::text').get().strip(),
                'url': response.urljoin(article.css('a::attr(href)').get()),
                'datetime': article.css('span.time::text').get().strip()
            }

        next_url = response.css('div.mod_pages a.pgNext::attr(href)').get()
        if next_url is not None:
            yield response.follow(next_url, callback=self.parse)

该爬虫可以从腾讯新闻网站提取科技新闻的标题、URL和时间，并支持翻页。

总结

Scrapy是一个灵活、高效的爬虫框架，支持多级页面特征提取和多种数据存储格式等功能。在本教程中，我们介绍了Scrapy的基础应用，包括如何创建爬虫、添加爬取规则、处理爬取数据和数据存储。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python爬虫框架Scrapy基本应用学习教程 - Python技术站

python爬虫框架Scrapy基本应用学习教程