爬虫学习:使用scrapy爬取猫眼电影

操作步骤

1.生成项目(在cmd或shell窗口运行以下3列代码)

scrapy startproject movieinfo
cd movieinfo
scrapy genspider maoyanm

生成文件结构如下:

爬虫学习:使用scrapy爬取猫眼电影

 

2.相关文件内容编辑

maoyanm.py

# -*- coding: utf-8 -*-
import scrapy
from moviesinfo.items import MoviesinfoItem


class MaoyanmSpider(scrapy.Spider):
    name = 'maoyanm'
    allowed_domains = ['maoyan.com']
    
    start_urls = ['https://maoyan.com/films?showType=3&offset={}'.format((n-1)*30) for n in range(1,500)]

    def parse(self, response):
        urls = response.xpath('//dd/div[2]/a/@href').extract()
        for url in urls:
            yield scrapy.Request('https://maoyan.com'+url, callback=self.parseContent)
            #print('https://maoyan.com'+url)
    
    def parseContent(self,response):
        names = response.xpath('/html/body/div[3]/div/div[2]/div[1]/h3/text()').extract()
        ennames = response.xpath('//div[@class="ename ellipsis"]/text()').extract()
        movietype = response.xpath('//li[@class="ellipsis"][1]/text()').extract()
        movietime = response.xpath('//li[@class="ellipsis"][2]/text()').extract()
        releasetime = response.xpath('//li[@class="ellipsis"][3]/text()').extract()
        print(str(names[0])+str(ennames[0]),movietype,movietime,releasetime)
        #实例化
        movieItem = MoviesinfoItem()

        movieItem['name'] = str(names[0])+' '+str(ennames[0])
        movieItem['movietype'] = movietype[0]
        movieItem['movietime'] = movietime[0].replace('\n','').replace(" ","")
        movieItem['releasetime'] = releasetime[0]

        yield movieItem

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class MoviesinfoItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    movietype = scrapy.Field()
    movietime = scrapy.Field()
    releasetime = scrapy.Field()

    pass
pipelines.py
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json

class MoviesinfoPipeline(object):
    def open_spider(self,spider):
        self.f = open('movies.json','a',encoding='utf-8')
        
    def close_spider(self,spider):
        self.f.close()
        
    def process_item(self, item, spider):
        data = json.dumps(dict(item),ensure_ascii=False)+'\n'
        self.f.write(data)

        return item

settings.py

ITEM_PIPELINES = {
    'moviesinfo.pipelines.MoviesinfoPipeline': 300,
}#找到这行代码去掉注释

 

修改user-agent(非必须选项)

安装fake_useragent(在cmd或shell窗口运行下面这列代码)
pip install fake_useragent

middlewares.py

#添加以下代码!!!!
import random
from fake_useragent import UserAgent


class RandomUserAgentMiddleware(object):
    #随机更换user-agent
    def __init__(self,crawler):
        super(RandomUserAgentMiddleware,self).__init__()
        self.ua = UserAgent()
        self.ua_type = crawler.settings.get("RANDOM_UA_TYPE","random")
 
    @classmethod
    def from_crawler(cls,crawler):
        return cls(crawler)
 
    def process_request(self,request,spider):
        def get_ua():
            return getattr(self.ua,self.ua_type)
 
        request.headers.setdefault('User-Agent',get_ua())

 

3.运行爬虫(在cmd或shell窗口运行下面这列代码)

scrapy crawl maoyanm

等待.........

 

ps.没有像预想中爬完 所有页面,后来发现到一定页数后页面不会显示,之后还要需要学习一些反爬机制解决问题,或者找一些反爬机制不完善的网页进行爬取。

 

参考资料:

https://www.cnblogs.com/zhaopanpan/articles/9339784.html

https://www.bilibili.com/video/av19057145

https://www.bilibili.com/video/av27782740

https://www.bilibili.com/video/av30272877

 

scrapy更换user-agent:

https://blog.csdn.net/sinat_41701878/article/details/80295600

https://blog.csdn.net/dta0502/article/details/82666421

https://blog.csdn.net/weixin_42260204/article/details/81087402

https://www.cnblogs.com/cnkai/p/7401343.html