python爬虫之BeautifulSoup

2023年4月13日下午10:36 • 爬虫

# -*- coding: UTF-8 -*-
import re
from bs4 import BeautifulSoup
import requests
import codecs
import sys  
reload(sys)  
sys.setdefaultencoding('utf8') 

def mei_url():
    url = 'http://mdl.com/product'
    web_data = requests.get(url)
    web_data.encoding = 'utf-8'
    soup = BeautifulSoup(web_data.text, 'lxml')
    return soup
    
def mei_info(sub_url='/product/item/293410'):
    url = 'http://mdl.com'+sub_url
    web_data = requests.get(url)
    web_data.encoding = 'utf-8'
    soup = BeautifulSoup(web_data.text, 'lxml')
    title=soup.select('#main > div.boundary > div > div.container__main > div.section.section-info.clearfix > h2')[0].get_text()
    introduce=soup.select('#main > div.boundary > div > div.container__main > div.section.section-intro.clearfix > div > div.section-intro__item__body.rich-text')[0].get_text()
    effect=soup.select('#main > div.boundary > div > div.container__main > div.section.section-intro.clearfix > div > div.section-intro__item__body.rich-text > span')[0].get_text()
    crowd=soup.select('#main > div.boundary > div > div.container__main > div.section.section-intro.clearfix > div > div.section-intro__item__body.rich-text')[2].get_text()
    print  title
    with codecs.open(r'E:\note\mei_infov3.txt', "a+",'utf8') as file: 
        file.write('&'.join(map(lambda x:str(x),[title,introduce,effect,crowd])))
        file.write('\n')
        file.write('$')
if __name__=='__main__':
    
    # items=mei_url()
    # items=str(items)
    soup1 = BeautifulSoup(open(r'E:\note\mei.htm'),'lxml')
    items1=str(soup1)
    url_list1=re.findall(r'/product/item/\d{6}',items1 )
    soup2 = BeautifulSoup(open(r'E:\note\mei2.htm'),'lxml')
    items2=str(soup2)
    url_list2=re.findall(r'/product/item/\d{6}',items2 )
    url_list3=url_list1+url_list2
    print len(url_list3)
    for sub_url in url_list3:
        mei_info(sub_url)

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python爬虫之BeautifulSoup - Python技术站

赞 (0)

微信扫一扫

微信扫一扫

支付宝扫一扫

支付宝扫一扫

pyqt5 + pyinstaller 制作爬虫小程序

上一篇 2023年4月13日

python爬虫之PyQuery

下一篇 2023年4月13日

C# 学习之路–百度网盘爬虫设计与实现（一）

百度网盘爬虫现在市面上出现了很多网盘搜索引擎，写这系列博文及爬虫程序的初衷：更方面的查找资源学习C# 学习爬虫的设计与实现记录学习历程自我监督能力有限，如有不妥之处，还请各位看官点评。同在学习的网友~与君共勉。工具/库选择 mysql5.6 (习惯使然，sqlserver比较庞大，个人使用起来不是很习惯，后期可能改为sqlserver) Htt…

爬虫 2023年4月13日
000
python Scrapy爬虫框架的使用

Python Scrapy爬虫框架的使用 Scrapy是一个用于爬取Web站点并从中提取数据的Python应用程序框架。本攻略将介绍使用Scrapy构建Python爬虫的基本步骤。安装Scrapy 在终端中使用以下命令安装Scrapy： pip install scrapy 构建爬虫以下示例将介绍如何使用Scrapy构建爬虫程序。该程序将从特定网站抓取…

python 2023年5月14日
000
Python爬虫之对CSDN榜单进行分析

Python爬虫之对CSDN榜单进行分析 1. 爬取CSDN榜单数据首先，我们需要利用Python爬虫获取CSDN榜单数据。具体步骤如下：安装所需的库：requests、BeautifulSoup。 pip install requests pip install BeautifulSoup4 确定爬取的目标链接，并利用requests库发送GET请求获…

python 2023年5月14日
000
用python写爬虫笔记（一）

https://bitbucket.org/wswp/code http://example.webscraping.com http://www.w3schools.com selenium.googlecode.com/git/docs/api/py/index.html 什么是XPath：http://www.w3.org/TR/xpath/ XP…

爬虫 2023年4月13日
000
python爬虫 – Urllib库及cookie的使用

lz提示一点，python3中urllib包括了py2中的urllib+urllib2。[python2和python3的区别、转换及共存 – urllib] 怎样扒网页？其实就是根据URL来获取它的网页信息，虽然我们在浏览器中看到的是一幅幅优美的画面，但是其实是由浏览器解释才呈现出来的，实质它是一段HTML代码，加 JS、CSS。如果把网页比作一个人，…

爬虫 2023年4月13日
000
python实现提取百度搜索结果的方法

下面是“python实现提取百度搜索结果的方法”的完整攻略。 1. 确定用到的库和工具首先需要导入一些库和工具，来实现提取百度搜索结果的操作。这些库和工具包括： requests：用于发送HTTP请求 BeautifulSoup：用于解析HTML代码 lxml：解析器，用于解析HTML代码 2. 爬取搜索结果页面通过requests发送HTTP GET请…

python 2023年5月14日
000
慧聪网爬虫

import requests from bs4 import BeautifulSoup import pandas as pd import gevent from gevent import monkey;monkey.patch_all() import time import re import random UA_list = [ ‘Mozill…

爬虫 2023年4月8日
000
python爬虫（二十） select方法

有时候需要css选择器 1、通过标签名查找： <style type=”text/css”> p{ background-color:pink; } </style> <body> <div class=”box”> <p>123</p> <p>456</p> &…

爬虫 2023年4月11日
000

合作推广

合作推广

返回顶部