python利用beautifulSoup实现爬虫

Python利用BeautifulSoup实现爬虫攻略

准备工作

在开始Python利用BeautifulSoup实现爬虫之前，需要先准备一些工作。首先，需要安装Python解释器和BeautifulSoup库。

如果你还没有安装Python，可以去官网https://www.python.org/downloads/下载对应版本的Python安装包进行安装。

安装完成后，需要安装BeautifulSoup库。可以使用pip命令来进行安装。

pip install beautifulsoup4

这将会自动安装最新版本的BeautifulSoup库。

在准备工作完成后，就可以开始利用BeautifulSoup实现爬取网站数据的操作。

爬取网站数据

首先，需要确定要爬取的网站和数据。这里以爬取豆瓣电影Top250的数据为例。

import requests
from bs4 import BeautifulSoup

url = 'https://movie.douban.com/top250'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

movies = []
for movie in soup.select('div.item'):
    title = movie.select('.title')[0].get_text()
    link = movie.select('a')[0]['href']
    rating = movie.select('.rating_num')[0].get_text()
    movies.append({'title': title, 'link': link, 'rating': rating})

for movie in movies:
    print(movie['title'], movie['link'], movie['rating'])

运行该代码，将会输出豆瓣电影Top250的电影名称、链接和评分。

爬取图片

另外一个例子是爬取指定网站上的图片。这里以爬取Unsplash网站上的图片为例。

import requests
from bs4 import BeautifulSoup
import os

url = 'https://unsplash.com/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

images = []
for image in soup.select('img[srcset]'):
    images.append(image['srcset'].split(' ')[0][:-1])

if not os.path.exists('unsplash'):
    os.makedirs('unsplash')

for i, image_url in enumerate(images[:10]):
    try:
        response = requests.get(image_url, headers=headers)
        with open('unsplash/{}.jpg'.format(i), 'wb') as f:
            f.write(response.content)
    except:
        continue

运行该代码，将会在当前文件夹下创建unsplash文件夹，并爬取Unsplash网站上的前10张图片保存到该文件夹中。

总结

以上就是利用Python和BeautifulSoup实现爬虫的攻略。首先需要准备Python解释器和BeautifulSoup库，然后确定要爬取的网站和数据，最后根据需要编写爬取数据的脚本。在编写爬虫脚本时需要注意规范的代码和请求头，以避免被网站封禁。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python利用beautifulSoup实现爬虫 - Python技术站

python利用beautifulSoup实现爬虫

Python利用BeautifulSoup实现爬虫攻略

准备工作

爬取网站数据

爬取图片

总结

相关文章