使用scrapy实现爬网站例子和实现网络爬虫(蜘蛛)的步骤

使用Scrapy实现爬取网站例子和实现网络爬虫（蜘蛛）的步骤如下：

步骤一：创建Scrapy项目

使用命令行工具创建一个Scrapy项目：

scrapy startproject <project_name>

这将创建一个默认的Scrapy项目，在项目目录下有一个名为scrapy.cfg的配置文件和一个名为<project_name>的文件夹，该文件夹包含一个名为items.py的文件，一个名为middlewares.py的文件，一个名为pipelines.py的文件和一个名为settings.py的文件。

步骤二：定义Item

Item是Scrapy提供的一个数据容器，用于存储爬取到的数据。可以在<project_name>/items.py文件中定义Item，例如：

import scrapy

class MyItem(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()
    url = scrapy.Field()

步骤三：定义Spider

Spider是Scrapy中主要的爬取逻辑。Spider定义了如何爬取页面，如何从页面中提取数据，以及如何遍历页面以及如何跟踪链接。可以在<project_name>/spiders文件夹中定义Spider，例如：

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        for article in response.css('.article'):
            item = MyItem()
            item['title'] = article.css('.title a::text').get()
            item['content'] = article.css('.content::text').get()
            item['url'] = response.urljoin(article.css('.title a::attr(href)').get())
            yield item

        next_page = response.css('.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

在这个Spider中，name是Spider的名称，start_urls是起始URL列表。在parse方法中，我们使用css选择器解析页面，提取需要的数据。然后使用yield关键字返回MyItem实例。

步骤四：启动爬虫

使用命令行启动爬虫：

scrapy crawl myspider

这将启动爬虫并开始爬取从start_urls中获取的页面，然后将数据保存到已定义的Item中，并输出日志信息。

示例一：爬取豆瓣电影Top250

这是一个使用Scrapy爬取豆瓣电影Top250的示例，该示例演示了如何使用Scrapy实现爬取网站的基本步骤。

创建Scrapy项目：

bash scrapy startproject doubanmovie

定义Item

在<project_name>/items.py文件中定义MovieItem类：

```python
import scrapy

class MovieItem(scrapy.Item):
title = scrapy.Field()
year = scrapy.Field()
score = scrapy.Field()
```

定义Spider

在<project_name>/spiders文件夹中定义movie.py文件：

```python
import scrapy
from doubanmovie.items import MovieItem

class MovieSpider(scrapy.Spider):
name = 'movie'
start_urls = ['https://movie.douban.com/top250']

   def parse(self, response):
       for movie in response.css('.item'):
           item = MovieItem()
           item['title'] = movie.css('.title::text').get()
           item['year'] = movie.css('.bd span::text').re_first('\d{4}')
           item['score'] = movie.css('.rating_num::text').get()
           yield item

       next_page = response.css('.next a::attr(href)').get()
       if next_page is not None:
           yield response.follow(next_page, self.parse)

```

启动爬虫

bash scrapy crawl movie -o movies.csv

分析结果

爬虫输出的结果将保存在movies.csv文件中，打开该文件可以看到豆瓣电影Top250的电影名称、上映年份、评分等信息。

示例二：爬取Zhihu用户信息

这是一个使用Scrapy爬取知乎用户信息的示例，该示例演示了如何使用Scrapy爬虫实现登录及信息爬取的步骤。

创建Scrapy项目：

bash scrapy startproject zhihuuser

定义Item

在<project_name>/items.py文件中定义ZhihuUserItem类：

```python
import scrapy

class ZhihuUserItem(scrapy.Item):
name = scrapy.Field()
gender = scrapy.Field()
headline = scrapy.Field()
location = scrapy.Field()
business = scrapy.Field()
employment = scrapy.Field()
education = scrapy.Field()
followees = scrapy.Field()
followers = scrapy.Field()
```

定义Spider

在<project_name>/spiders文件夹中定义user.py文件：

```python
import scrapy
import json
from zhihuuser.items import ZhihuUserItem

class UserSpider(scrapy.Spider):
name = 'user'
start_urls = ['https://www.zhihu.com']

   def start_requests(self):
       return [scrapy.FormRequest("https://www.zhihu.com/api/v3/oauth/sign_in",
                                  formdata={"client_id": "<your_client_id>",
                                            "grant_type": "password",
                                            "username": "<your_username>",
                                            "password": "<your_password>",
                                            "source": "com.zhihu.web"},
                                  callback=self.after_login)]

   def after_login(self, response):
       for url in self.start_urls:
           yield scrapy.Request(url, cookies=json.loads(response.text)['cookie'], callback=self.parse)

   def parse(self, response):
       for user in response.css('.UserLink-link'):
           url_token = user.css('::attr(href)').re_first('/people/(.*)')
           if url_token is not None:
               yield response.follow(f'/api/v4/members/{url_token}', self.parse_user)

       next_page = response.css('.Button-next::attr(href)').get()
       if next_page is not None:
           yield response.follow(next_page, self.parse)

   def parse_user(self, response):
       data = json.loads(response.text)
       item = ZhihuUserItem()
       item['name'] = data['name']
       item['gender'] = data['gender']
       item['headline'] = data['headline']
       item['location'] = data['location']['name'] if 'location' in data else None
       item['business'] = data['business']['name'] if 'business' in data else None
       item['employment'] = data['employment']['name'] if 'employment' in data else None
       item['education'] = data['education']['name'] if 'education' in data else None
       item['followees'] = data['following_count']
       item['followers'] = data['follower_count']
       yield item

```

在这个Spider中，我们首先通过发送Form表单的方式登录知乎，然后再获取初始页面，接着解析用户列表页面，分页爬取，最后解析每个用户的主页，提取所需数据。

需要注意: <your_client_id>, <your_username>, <your_password>需要替换成你自己的Client ID、用户名和密码。

启动爬虫

bash scrapy crawl user -o users.csv

分析结果

爬虫输出的结果将保存在users.csv文件中，打开该文件可以看到知乎用户的名称、基本信息、关注者等信息。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：使用scrapy实现爬网站例子和实现网络爬虫(蜘蛛)的步骤 - Python技术站