爬虫学习笔记：8684公交路线

2023年4月13日上午1:22 • 爬虫

SHOW ME THE CODE!!!

首先进行网页分析，具体操作：省略。

# -*- coding: utf-8 -*-
"""
Created on Fri Dec 10 16:25:59 2021
@author: Hider
"""

# 爬虫学习：8684公交路线
# 网站：https://www.8684.cn/
# 公交站点、地铁站点、违章、资讯等等数据

'''
--------- 网页分析 ----------
广州公交：https://guangzhou.8684.cn/
div class="bus-layer depth w120"
第3个 div class="p110"

市区编码线路：https://guangzhou.8684.cn/line1
div class="list clearfix"
a标签 href title

广州1路公交车路线：https://guangzhou.8684.cn/x_322e21c5
'''

上代码！！！

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import random
import time

def get_ua():
    user_agents = [
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
		'Opera/8.0 (Windows NT 5.1; U; en)',
		'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
		'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
		'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
		'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
		'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ',
		'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
		'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
		'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
		'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
		'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
		'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
		'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
		'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
		'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
		'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ',
    ]
    user_agent = random.choice(user_agents) # 随机抽取对象
    return user_agent

# 请求
url = 'https://guangzhou.8684.cn/'
response = requests.get(url=url, headers={'User-Agent':get_ua()}, timeout=10)

# 获取数据并解析
soup = BeautifulSoup(response.text, 'lxml')
soup_bus_layer = soup.find('div', class_='bus-layer depth w120')

# 解析分类数据
dict_result = {}
soup_bus_list = soup_bus_layer.find_all('div', class_='pl10')
for soup_bus in soup_bus_list:
    name = soup_bus.find('span', class_='kt').get_text()
    # print(name)
    if '线路分类' in name:
        soup_a_list = soup_bus.find('div', class_='list')
        for soup_a in soup_a_list.find_all('a'):
            text = soup_a.get_text()
            href = soup_a.get('href')
            dict_result[text] = 'https://guangzhou.8684.cn' + href

print(dict_result)

# 遍历各个线路
bus = []

for key, value in dict_result.items():
    print('Key is:', key) 
    print('Value is:', value)
    response = requests.get(url=value, headers={'User-Agent':get_ua()}, timeout=10)

    # 获取数据并解析
    soup = BeautifulSoup(response.text, 'lxml')
    # 详细线路
    soup_bus_list = soup.find('div', class_='list clearfix')
    for soup_a in soup_bus_list.find_all('a'):
        text = soup_a.get_text()
        href = soup_a.get('href')
        title = soup_a.get('title')
        bus.append([key, value, title, text, 'https://guangzhou.8684.cn' + href])
    
# print(bus)

# 公交线路明细车站
final_bus_result = []
# bus_test = bus[0:10]
index = 0
# 遍历每一条线路
for i in bus:
    print(f'正在爬取{i[2]}...')
    index += 1
    if index % 100 == 0:
        print('休息一下吧！~ZzzZ~ ')
        time.sleep(random.randint(5, 10)) # 添加随机时间
    print(index)
    url = i[4]
    response = requests.get(url=url, headers={'User-Agent':get_ua()}, timeout=10)
    # 获取数据并解析
    soup = BeautifulSoup(response.text, 'lxml')
    soup_bus_run = soup.find('ul', class_='bus-desc')
    # 运行时间
    bus_run_time = soup_bus_run.find_all('li')[0].get_text()
    # 参考票价
    bus_price = soup_bus_run.find_all('li')[1].get_text()
    # 公交公司
    try:
        bus_company = soup_bus_run.find_all('li')[2].find('a').get_text()
    except:
        bus_company = soup_bus_run.find_all('li')[2].get_text()
    # 最后更新
    bus_update_time = soup_bus_run.find_all('li')[3].get_text() # 此处应该可优化 只取内容 剔除div
    # 站点信息
    soup_bus_station = soup.find_all('div', class_='bus-lzlist mb15')[0]
    
    bus_station = {}
    for soup_bus in soup_bus_station.find_all('li'):
        text = soup_bus.get_text()
        href = soup_bus.find('a').get('href')
        bus_station[text] = 'https://guangzhou.8684.cn' + href
    final_bus_result.append([i[0], i[1], i[2], i[3], url, bus_run_time, bus_price, bus_company, bus_update_time, bus_station])



df = pd.DataFrame(final_bus_result).rename(columns={0:'线路分类', 1:'线路分类网址', 2:'线路', 3:'线路名称', 4:'线路网址', 5:'运行时间', 6:'参考票价', 7:'公交公司', 8:'最后更新', 9:'站点信息'})

df.to_csv(r'C:\Users\Hider\Desktop\bus.csv', index=False, encoding='utf-8-sig')

参考链接：手把手教学，正式开始！

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：爬虫学习笔记：8684公交路线 - Python技术站

赞 (0)

微信扫一扫

微信扫一扫

支付宝扫一扫

支付宝扫一扫

爬虫学习笔记：创建随机User-Agent池

上一篇 2023年4月13日

爬虫学习笔记：微信公众号文章图片下载

下一篇 2023年4月13日

网络爬虫的数据获取方式有哪些？

网络爬虫是一种通过自动化程序定期抓取网站数据的技术，它可以快速获取大量网站上的数据，并按照用户需求进行整理、分析和处理。网络爬虫的数据获取方式主要有以下几种：静态页面爬取静态页面是指页面内容不会被动态修改的网页，它们通常是由HTML和CSS代码组成，不包含动态脚本或交互式内容。网络爬虫可以通过HTTP协议发送请求并获取网页内容，然后解析HTML代码，从中…

爬虫 2023年4月20日
000
Python 网络爬虫 010 (高级功能) 解析 robots.txt 文件

使用的系统：Windows 10 64位 Python 语言版本：Python 2.7.10 V 使用的编程 Python 的集成开发环境：PyCharm 2016 04 我使用的 urllib 的版本：urllib2 注意：我没这里使用的是 Python2 ，而不是Python3 一 . 前言之前，我在网络爬虫科普的时候，介绍过robots.txt 文…

爬虫 2023年4月13日
000
python3 爬虫4–解析链接

1.urlparse() 属于urllib.parse 在urlparse世界里面，一个标准的URL链接格式如下 scheme://nrtlooc/path;paramters?query#fragment 所以，一个url=’http://www.baidu.com/index.html;user?id=5#comment’ 我们使用urlparse的话，…

爬虫 2023年4月11日
000
Python Scrapy爬虫（下）

Python Scrapy爬虫（下）一、在Pycharm中运行Scrapy爬虫项目的基本操作 1、Pycharm安装好Scrapy模块：scrapy的安装之前需要安装这个模块：方案一：lxml->zope.interface->pyopenssl->twisted->scrapy。方案二：wheel（安装.whl文件）、lxml（l…

爬虫 2023年4月11日
000
《爬虫网络开发实战》

爬虫基础 URL&&URI 请求方法：GET&&POST 响应基本库的使用 urllib urlopen(传递参数data) urlopen(设置超时timeout) 打开网站需要验证账号密码可以借助HTTPBasicAuthHandler完成代理IP,ProxyHandler Cookie 解析连接urlparse ur…

爬虫 2023年4月13日
000
爬虫3 css选择器和xpath选择器, selenium的使用, 爬取京东商品信息

1 css选择器和xpath选择器 # css选择器 ####### #1 css选择器 ####### # 重点 # Tag对象.select(“css选择器”) # #ID号 # .类名 # div>p：儿子和div p：子子孙孙 # 找div下最后一个a标签 div a:last-child # css选择器，xpath选择器会用了，它就是个通…

爬虫 2023年4月16日
000
Scrapy分布式爬虫打造搜索引擎- (二)伯乐在线爬取所有文章

二、伯乐在线爬取所有文章 1. 初始化文件目录基础环境 python 3.6.5 JetBrains PyCharm 2018.1 mysql+navicat 为了便于日后的部署：我们开发使用了虚拟环境。 1234567891011 pip install virtualenvpip install virtualenvwrapper-win安装虚拟环境管…

爬虫 2023年4月10日
000
python爬虫：使用BeautifulSoup修改网页内容

BeautifulSoup除了可以查找和定位网页内容，还可以修改网页。修改意味着可以增加或删除标签，改变标签名字，变更标签属性，改变文本内容等等。每一个标签在BeautifulSoup里面都被当作一个标签对象，这个对象可以执行以下任务：修改标签名修改标签属性增加新标签删除存在的标签修改标签的文本内容修改标签的名字只需要修改…

爬虫 2023年4月12日
000

合作推广

合作推广

返回顶部