Python抓取网页图片难点分析

1. 资源定位

要抓取网页中的图片，首先需要定位图片所在的资源路径，通常包括以下两种方式：

1.1 直接获取源代码中的图片链接

在页面源代码中，图片资源通常是通过<img>标签引用的，其路径可以通过标签的src属性获取。通过requests库获取网页源代码并对其进行解析，即可获取页面中所有图片的资源路径。

示例代码：

import requests
from bs4 import BeautifulSoup

url = "https://www.example.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
img_tags = soup.find_all("img")

for img in img_tags:
    print(img.get("src"))

1.2 解析JS代码获取资源链接

JS代码与HTML代码相互嵌套、互为调用，在JS代码中也可以获取到图片资源链接。使用正则表达式或专业的JS解析库，可以在JS代码中获取到需要的资源路径。

示例代码:

import re
import requests

url = "https://www.example.com/"
response = requests.get(url).text
img_list = re.findall(r"background-image:url\((.*?)\)", response)

for img in img_list:
    print(img)

2. 资源下载

完成资源定位以后，还需要对这些资源进行下载，并保存到本地。

2.1 直接使用requests库下载资源

通过requests库下载资源时，要注意使用二进制格式进行下载，并且在保存文件时使用正确的文件名。

示例代码：

import os
import requests

url = "https://www.example.com/img/example.jpg"
response = requests.get(url)

if response.status_code == 200:
    with open("example.jpg", "wb") as f:
        f.write(response.content)

2.2 使用urllib库下载资源

同样可以使用urllib库进行资源下载。

示例代码：

import os
import urllib.request

url = "https://www.example.com/img/example.jpg"
response = urllib.request.urlopen(url)

if response.status == 200:
    with open("example.jpg", "wb") as f:
        f.write(response.read())

3. 错误处理

在网络请求中，不可避免地会发生各式各样的错误，如服务器拒绝访问、网络超时等。要保证代码稳定可靠，需要对错误进行及时处理。

3.1 增加网络重试机制

有些错误是短暂的、偶发的，例如网络断开、服务器繁忙等，对于这些错误，可以使用网络重试机制进行处理。可以使用retrying库进行实现。

示例代码：

import os
import requests
from retrying import retry

@retry(stop_max_attempt_number=3)
def download_img(url, file_name):
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception("response status code is not 200")

    with open(file_name, "wb") as f:
        f.write(response.content)

url = "https://www.example.com/img/example.jpg"
file_name = "example.jpg"

try:
    download_img(url, file_name)
except Exception as e:
    print("error: ", e)

3.2 使用异常捕获处理

对于一些无法通过网络重试机制解决的错误，可以使用异常捕获进行处理，例如当服务器返回的状态码为404时，就说明请求的资源不存在。

示例代码：

import os
import requests

url = "https://www.example.com/img/example.jpg"
file_name = "example.jpg"

response = requests.get(url)

if response.status_code == 200:
    with open(file_name, "wb") as f:
        f.write(response.content)
else:
    raise Exception("status code is not 200, response status: %d" % response.status_code)

4. 总结

本文介绍了抓取网页图片的完整攻略，包括资源定位、资源下载和错误处理。通过本文的介绍，读者可以掌握Python抓取网页图片的基本方法及常见模块的使用。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python抓取网页图片难点分析 - Python技术站

Python抓取网页图片难点分析

Python抓取网页图片难点分析

1. 资源定位

1.1 直接获取源代码中的图片链接

1.2 解析JS代码获取资源链接

2. 资源下载

2.1 直接使用requests库下载资源

2.2 使用urllib库下载资源

3. 错误处理

3.1 增加网络重试机制

3.2 使用异常捕获处理

4. 总结

相关文章