scrapy crawl itcast -o teachers.json 爬虫案列

2023年4月11日上午3:31 • 爬虫

spider.py文件配置

  1 
  2 # -*- coding: utf-8 -*-
  3 import scrapy
  4 from itTeachers.items import ItteachersItem
  5 
  6 
  7 class ItcastSpider(scrapy.Spider):
  8     name = 'itcast'
  9     allowed_domains = ['itcast.cn']
 10     start_urls = ['http://www.itcast.cn/channel/teacher.shtml#']
 11 
 12     def parse(self, response):
 13         #with open("teacher.html","w") as f:
 14             #f.write(response.body)
 15 
 16         items = []
 17 
 18         teacher_list = response.xpath('//div[@class="li_txt"]')
 19         for each in teacher_list:
 20 
 21             #我们将得到的数据封装到一个'ItcastItem'对象
 22             item = ItteachersItem()
 23             name = each.xpath('h3/text()').extract()
 24             title = each.xpath('h4/text()').extract()
 25             info = each.xpath('p/text()').extract()
 26 
 27             #xpath返回的是包含一个元素的列表
 28             item['name'] = name[0]
 29             item['title'] = title[0]
 30             item['info'] = info[0]
 31 
 32             items.append(item)
 33         #直接返回最后数据
 34         return items
~

items.py文件配置

  1 # -*- coding: utf-8 -*-
  2 
  3 # Define here the models for your scraped items
  4 #
  5 # See documentation in:
  6 # https://doc.scrapy.org/en/latest/topics/items.html
  7 
  8 import scrapy
  9 
 10 
 11 class ItteachersItem(scrapy.Item):
 12     # define the fields for your item here like:
 13     # name = scrapy.Field()
 14     name = scrapy.Field()
 15     title = scrapy.Field()
 16     info = scrapy.Field()

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：scrapy crawl itcast -o teachers.json 爬虫案列 - Python技术站

赞 (0)

微信扫一扫

微信扫一扫

支付宝扫一扫

支付宝扫一扫

Python 爬虫使用固定代理IP

上一篇 2023年4月11日

Python 爬虫实战（二）：使用 requests-html

下一篇 2023年4月11日

Python面试题爬虫篇小结(附答案)

在文章“Python面试题爬虫篇小结(附答案)”中，作者总结了一些与爬虫相关的Python面试题，并给出了详细的解答。下面是该文章的完整攻略： 1. 文章主旨该文章的主旨是介绍Python面试中可能出现的爬虫相关题目，并给出详细的解答。文章共介绍了10道题目，包括爬取网页、分析页面结构、处理数据等方面。通过掌握这些题目，读者可以加强自己的爬虫能力和面试表现…

python 2023年5月14日
000
爬虫– 初级

普通同步代码耗时 import requests from functools import wraps import time def time_count(func): @wraps(func) def inner_func(*args,**kw): start = time.time() result = func(*args,**kw) end =…

爬虫 2023年4月16日
000
python实现selenium网络爬虫的方法小结

Python实现Selenium网络爬虫的方法小结什么是Selenium？ Selenium是一个自动化测试工具，通过模拟真实的用户操作，例如点击、输入等，与网站进行交互，获取所需数据。安装Selenium 在Python中安装Selenium很简单，使用pip命令安装即可： pip install selenium 下载并配置浏览器驱动 Seleniu…

python 2023年5月14日
000
财经数据（1）-开盘啦营业部标签及龙虎榜数据爬虫

目标：爬取开盘啦特色营业部标签数据及每日龙虎榜数据上代码： # -*- coding: utf-8 -*- import requests import json import pandas as pd from sqlalchemy import create_engine import time import datetime from reque…

爬虫 2023年4月11日
000
python爬虫提取冰与火之歌五季的种子

# -*- encoding:utf-8 -*-import requestsimport re import sysreload(sys)sys.setdefaultencoding(“utf-8”) url = ‘http://www.vipspark.com/TVplay.html’head = {‘User-Agent’:’Mozilla/5.0 (…

爬虫 2023年4月10日
000
scrapy爬虫笔记(二)——交互式爬取

开始网页爬取：(1)交互式爬取　　首先，我们使用scrapy建立起爬虫的框架。在命令行中输入 scrapy shell “url” 　　如：scrapy shell “http://www.baidu.com” 　　（注意：此处一定要写清楚传输协议，否则将无法链接到对应网站，此例中为http://）　　scrapy 会自动创建response对象，并自动…

爬虫 2023年4月11日
000
爬虫大作业

1.使用urllib库对网页进行爬取，其中’https://movie.douban.com/cinema/nowplaying/guangzhou/’是豆瓣电影正在上映的电影页面，定义html_data变量，存放网页html代码，输入 print(html_data)可以查看输出结果。 from urllib import request resp = …

爬虫 2023年4月11日
000
Python趣味爬虫之爬取爱奇艺热门电影

Python趣味爬虫之爬取爱奇艺热门电影攻略 1. 爬虫的准备工作爬虫需要安装requests和BeautifulSoup库，可以通过pip进行安装： pip install requests pip install beautifulsoup4 2. 获取热门电影页面链接首先访问爱奇艺热门电影页面：https://www.iqiyi.com/diany…

python 2023年5月14日
000

合作推广

合作推广

返回顶部