Scrapy框架CrawlSpiders介绍

Scrapy是一个高效的Python爬虫框架，它采用异步IO模式，具有强悍的异步网络通信能力，在爬取大规模数据时表现出色。CrawlSpiders是Scrapy框架提供的一种方便易用的爬虫机制，它基于规则匹配和提取，可以便捷的完成数据爬取和处理。CrawlSpiders拥有灵活的爬取方式，可以通过url的正则表达式、xpath、css等方式进行爬取。在使用CrawlSpiders时，我们通常要设置一些规则，用于告诉爬虫如何进行爬取。

使用CrawlSpiders的步骤

1. 创建CrawlSpider

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MyCrawlerSpider(CrawlSpider):
    name = 'my_crawler'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        Rule(LinkExtractor(allow='/page/\d+/'), follow=True),
        Rule(LinkExtractor(allow='/post/'), callback='parse_item'),
    )

    def parse_item(self, response):
        pass

在这个代码中，我们首先导入了CrawlSpider和Rule类，同时也导入了用于提取url的LinkExtractor类。然后定义了我们的MyCrawlerSpider类，设置了name、allowed_domains、start_urls等属性，指定了爬虫的基本信息。接着，设置了两个规则，第一个规则是用于提取类似http://www.example.com/page/2/这样的url，用于继续爬取下一页，这里使用了follow=True，表示该url可以被爬虫继续跟进，后面的规则则是用于处理http://www.example.com/post/xxx.html这样的url，同时指定了处理函数，用于对该url响应回来的数据进行处理。

2. 定义处理函数

在CrawlSpider中，我们一般会定义一个parse_item()函数来对响应进行处理。这个函数的输入是响应对象response，我们可以利用response进行数据提取和处理，并生成Item对象进行返回，示例代码如下：

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from myproject.items import MyItem

class MyCrawlerSpider(CrawlSpider):
    name = 'my_crawler'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        Rule(LinkExtractor(allow='/page/\d+/'), follow=True),
        Rule(LinkExtractor(allow='/post/'), callback='parse_item'),
    )

    def parse_item(self, response):
        item = MyItem()
        item['url'] = response.url
        item['title'] = response.xpath('//title/text()').extract_first().strip()
        item['content'] = response.css('div.content').extract_first().strip()
        return item

3. 运行爬虫

在完成CrawlSpider和处理函数后，我们需要启动爬虫，并将数据存储到文件中。我们可以在Scrapy项目文件夹下的终端中使用以下命令启动爬虫：

scrapy crawl my_crawler -o items.json

这条命令将启动我们的爬虫，并将爬取到的数据以json格式存储到items.json文件中。

示例1: 使用正则表达式提取url

我们可以使用CrawlSpider和LinkExtractor类基于正则表达式提取url，以下是一个示例，用于爬取豆瓣电影的top250的电影信息。

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from myproject.items import DoubanMovieItem

class DoubanMovieSpider(CrawlSpider):
    name = 'douban_movie_spider'
    allowed_domains = ['movie.douban.com']
    start_urls = ['https://movie.douban.com/top250']

    rules = (
        Rule(LinkExtractor(allow='/subject/\d+/'), callback='parse_item'),
        Rule(LinkExtractor(allow='?start=\d+&filter='), follow=True),
    )

    def parse_item(self, response):
        item = DoubanMovieItem()
        item['url'] = response.url
        item['title'] = response.xpath('//h1/span/text()').extract_first().strip()
        item['director'] = response.xpath('//div[@id="info"]/span[1]/span[@class="attrs"]/a/text()'
                                            '|//div[@id="info"]/span[1]/span[2]/a/text()').extract()
        item['scriptwriter'] = response.xpath('//div[@id="info"]/span[2]/span[@class="attrs"]/a/text()'
                                            '|//div[@id="info"]/span[2]/span[2]/a/text()').extract()
        item['actor'] = response.xpath('//div[@id="info"]/span[3]/span[@class="attrs"]/a/text()'
                                            '|//div[@id="info"]/span[3]/span[2]/a/text()').extract()
        item['type'] = response.xpath('//div[@id="info"]/span[@property="v:genre"]/text()'
                                            '|//div[@id="info"]/span[@class="pl"][contains(text(),"类型:")]/following-sibling::span/text()').extract()
        item['producer_country'] = response.xpath('//div[@id="info"]/span[@property="v:initialReleaseDate"]/following-sibling::text()'
                                            '|//div[@id="info"]/span[@class="pl"][contains(text(),"制片国家/地区:")]/following-sibling::text()').extract()
        item['language'] = response.xpath('//div[@id="info"]/span[@property="v:language"]/following-sibling::text()'
                                            '|//div[@id="info"]/span[@class="pl"][contains(text(),"语言:")]/following-sibling::text()').extract()
        item['release_date'] = response.xpath('//div[@id="info"]/span[@property="v:initialReleaseDate"]/text()'
                                            '|//div[@id="info"]/span[@class="pl"][contains(text(),"上映日期:")]/following-sibling::span/text()').extract()
        item['run_time'] = response.xpath('//div[@id="info"]/span[@property="v:runtime"]/text()'
                                            '|//div[@id="info"]/span[@class="pl"][contains(text(),"片长:")]/following-sibling::text()').extract_first()
        item['rate'] = response.xpath('//strong[@class="ll rating_num"]/text()').extract_first()
        item['brief'] = response.css('span.all.hidden').extract_first()
        return item

在这个示例中，我们使用了LinkExtractor分别提取了电影详情和下一页的url。在处理函数parse_item()中，我们使用了xpath和css选择器分别提取了电影的信息。

示例2: 使用callback参数处理多个响应

有时候，我们需要在处理函数中处理多个响应，并生成多个Item对象进行返回，我们可以使用callback参数处理多个响应。以下是一个示例，用于爬取官方文档中给出的quotes.infospider.us网站页面的所有名言以及它们的作者。

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from myproject.items import QuoteItem

class QuotesSpider(CrawlSpider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/page/1/"]

    rules = (
        Rule(LinkExtractor(allow=r"page/\d+/"), follow=True),
        Rule(LinkExtractor(allow=r"/author/[\w-]+/"), callback="parse_author"),
        Rule(LinkExtractor(allow=r"/tag/[\w-]+/"), follow=True),
        Rule(LinkExtractor(allow=r"/"), callback="parse_quotes"),
    )

    def parse_author(self, response):
        author_name = response.xpath("//h3/text()").get()
        yield QuoteItem(
            text=f"{author_name} says {response.xpath('//span[@class=\"text\"]/text()').get()}",
            author=author_name,
        )

    def parse_quotes(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            yield QuoteItem(
                text=quote.xpath(".//span[@class=\"text\"]/text()").get(),
                author=quote.xpath(".//span/small/text()").get(),
            )

在这个示例中，我们通过两个Rule来处理网站中不同的页面类型。第一个Rule使用parse_author()回调函数处理作者页面，并生成名人名言。第二个Rule使用parse_quotes()回调函数处理页面上所有名言，并生成一个Item对象返回。在处理名称页面时，我们提取了名言及作者名字，然后根据这两个值在生成的Item中分别保存。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Scrapy框架CrawlSpiders的介绍以及使用详解 - Python技术站