Python爬虫教程使用Scrapy框架爬取小说代码示例是一篇讲解如何使用Scrapy爬虫框架爬取小说网站的教程。在这个过程中，包括创建Scrapy项目、编写爬虫代码、解析HTML页面、提取数据等步骤，下面我将一一进行详细讲解。

1. 创建Scrapy项目

首先，我们需要创建一个Scrapy项目，使用命令行进入想要存储项目的目录下，然后执行以下命令：

scrapy startproject novel

这会创建一个名为novel的项目文件夹，其中包含一些Scrapy框架所需要的文件。

2. 定义Spider爬虫

接下来，我们需要定义一个Spider爬虫，该爬虫会根据我们在代码中提供的规则进行网页爬取。

例如，我们可以定义一个名为NovelSpider的Spider爬虫，用于爬取某个小说网站上的小说列表页。执行以下命令：

scrapy genspider NovelSpider novel.com

这将会在novel/spiders目录下创建一个名为NovelSpider.py的文件，我们在该文件中定义具体的爬虫逻辑。

3. 编写爬虫代码

在NovelSpider.py文件中，我们需要编写一些代码，用于定义爬虫的逻辑和规则，包括设置请求头、解析页面、提取数据等。

具体来说，我们要做以下几件事情：

设置请求头，避免被反爬虫机制识别：

# settings.py中添加
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'

# NovelSpider.py文件中添加
class NovelSpider(scrapy.Spider):
    name = "novel"
    allowed_domains = ["novel.com"]
    start_urls = ["http://www.novel.com/"]

    headers = {
        'User-Agent': USER_AGENT
    }

解析页面，提取数据：

class NovelSpider(scrapy.Spider):
    # ...

    def parse(self, response):
        novels = response.xpath('//ul[@class="novels-list"]/li')
        for novel in novels:
            item = NovelItem()
            item['title'] = novel.xpath('h4/a/text()').extract_first().strip()
            item['author'] = novel.xpath('p[@class="author"]/a/text()').extract_first().strip()
            item['intro'] = novel.xpath('p[@class="intro"]/text()').extract_first().strip()
            yield item

在这段代码中，我们使用xpath解析页面，并使用extract_first()方法提取第一个匹配的结果，从而得到小说的标题、作者和简介。

4. 存储数据

爬取到的数据并没有直接显示在页面上，而是存储在了内存中。为了将这些数据存储到本地或是数据库中，我们需要自定义处理管道。

可以在settings文件中设置数据库相关的信息，例如这里我使用MySQL数据库存储爬虫数据：

# settings.py中添加
MYSQL_HOST = 'localhost'
MYSQL_PORT = 3306
MYSQL_DBNAME = 'novel_db'
MYSQL_USER = 'root'
MYSQL_PASSWD = '123456'

# NovelSpider.py文件中添加
import pymysql.cursors

class NovelPipeline(object):
    def __init__(self):
        self.connection = pymysql.connect(
            host=MYSQL_HOST,
            port=MYSQL_PORT,
            user=MYSQL_USER,
            password=MYSQL_PASSWD,
            db=MYSQL_DBNAME,
            charset='utf8mb4',
            cursorclass=pymysql.cursors.DictCursor
        )
        self.cursor = self.connection.cursor()

    def process_item(self, item, spider):
        sql = "INSERT INTO novel (title, author, intro) VALUES (%s, %s, %s)"
        self.cursor.execute(sql, (item['title'], item['author'], item['intro']))
        self.connection.commit()
        return item

    def close_spider(self, spider):
        self.connection.close()

在这段代码中，我们首先在__init__()方法中建立数据库连接，然后在process_item()方法中将爬取到的数据插入到数据库中，最后在close_spider()方法中关闭数据库连接。

5. 运行爬虫

至此，我们已经完成了Scrapy爬虫程序的编写，接下来需要运行该程序进行小说网站的爬取。

在命令行中执行以下命令即可启动爬虫：

scrapy crawl novel

输出结果：

2018-10-17 09:59:03 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: novel)
2018-10-17 09:59:03 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134-SP0
2018-10-17 09:59:03 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'novel', 'NEWSPIDER_MODULE': 'novel.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['novel.spiders']}
2018-10-17 09:59:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-10-17 09:59:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-10-17 09:59:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-10-17 09:59:03 [scrapy.middleware] INFO: Enabled item pipelines:
['novel.pipelines.NovelPipeline']
2018-10-17 09:59:03 [scrapy.core.engine] INFO: Spider opened
2018-10-17 09:59:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.novel.com/robots.txt> (referer: None)
2018-10-17 09:59:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.novel.com/> (referer: None)
2018-10-17 09:59:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.novel.com/>
{'title': '神兵传奇', 'author': '梁羽生', 'intro': '江湖上流传着许多关于兵器的神话传说，有传说…'}
2018-10-17 09:59:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.novel.com/>
{'title': '天龙八部', 'author': '金庸', 'intro': '蒙古可汗儿完颜洪熙在围攻襄阳的战争中被华山大小姐殷…'}
2018-10-17 09:59:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.novel.com/>
{'title': '笑傲江湖', 'author': '金庸', 'intro': '于湘西遇见了一个名叫程灵素，他被这个秀丽的女子所吸引…'}
2018-10-17 09:59:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.novel.com/>
{'title': '鹿鼎记', 'author': '金庸', 'intro': '明朝天启元年，陕西自来水集团总经理陆游炜来到了李鸿…'}
2018-10-17 09:59:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.novel.com/>
{'title': '射雕英雄传', 'author': '金庸', 'intro': '南宋年间，金国入侵中国领土。英雄人物郭靖和黄蓉联手终于…'}
2018-10-17 09:59:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.novel.com/>
{'title': '倚天屠龙记', 'author': '金庸', 'intro': '天龙寺遗失了《九阳真经》，僧人们认为失落已久的真经正是一…'}
2018-10-17 09:59:04 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-17 09:59:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 423,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 29831,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'dupefilter/filtered': 6,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 10, 17, 1, 59, 4, 752210),
 'item_scraped_count': 6,
 'log_count/DEBUG': 9,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2018, 10, 17, 1, 59, 3, 921501)}
2018-10-17 09:59:04 [scrapy.core.engine] INFO: Spider closed (finished)

以上输出结果表明已成功通过爬虫程序获取到了小说的相关信息，并将其存储到了MySQL数据库之中。

示例1：爬取豆瓣Top250电影

以豆瓣Top250电影为例，创建Spider爬虫的代码如下：

import scrapy

class DoubanSpider(scrapy.Spider):
    name = "douban"
    allowed_domains = ["douban.com"]
    start_urls = ["https://movie.douban.com/top250"]

    def parse(self, response):
        for movie in response.xpath('//div[@class="hd"]'):
            title = movie.xpath('a/span/text()').extract_first().strip()
            link = movie.xpath('a/@href').extract_first().strip()

            yield {
                'title': title,
                'link': link
            }

示例2：爬取微博热门话题

我们也可以爬取微博上的热门话题和相关微博内容，代码如下：

import scrapy

class WeiboSpider(scrapy.Spider):
    name = "weibo"
    allowed_domains = ["weibo.com"]
    start_urls = ["https://s.weibo.com/top/summary?cate=realtimehot"]

    def parse(self, response):
        for topic in response.xpath('//td[@class="td-02"]'):
            title = topic.xpath('a/text()').extract_first().strip()
            link = topic.xpath('a/@href').extract_first().strip()

            yield {
                'title': title,
                'link': link
            }

以上便是使用Scrapy框架爬取小说的完整攻略，其中包括Scrapy项目的创建、爬虫编写和运行等过程，同时也给出了两个示例，希望能对大家在实际开发中有所启发。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python爬虫教程使用Scrapy框架爬取小说代码示例 - Python技术站

Python爬虫教程使用Scrapy框架爬取小说代码示例