python爬虫爬取图片的简单代码

下面是关于"python爬虫爬取图片的简单代码"的完整攻略和示例说明：

什么是Python爬虫？

Python爬虫是指使用Python编写程序，自动化地从网站抓取数据。Python爬虫是一种非常强大的工具，使用它，可以快速地获取大量的数据。

如何使用Python爬虫抓取图片？

爬取图片的过程和爬取普通的文本信息的过程大体相似，只是需要使用不同的方法来下载和处理图片数据。

第一步：确定需要爬取的图片网址

首先需要确定需要爬取的图片所在的网址。可以通过查看网页源代码或者使用浏览器的开发者工具来确定。

第二步：分析网页结构

在确定了需要爬取的图片网址之后，需要对网页结构进行分析，找到包含需要爬取图片的HTML元素，进而确定获取图片的方法。一般情况下，图片都是以img标签的形式出现在页面上。

第三步：编写Python爬虫程序

有了前两步的准备工作，就可以开始编写Python爬虫程序了。下面是一个简单的示例代码，可以实现爬取指定网页中的所有图片并保存到本地：

import requests
from bs4 import BeautifulSoup
import os

def download_image(url, save_dir):
    try:
        response = requests.get(url, timeout=5)
        if response.status_code == 200:
            file_name = url.split('/')[-1]
            file_path = os.path.join(save_dir, file_name)
            with open(file_path, 'wb') as f:
                f.write(response.content)
                print('Downloaded image: %s' % url)
        else:
             print('Failed to download image: %s' % url)
    except Exception as e:
        print('Error: %s' % e)

def crawl_images(url, save_dir):
    try:
        response = requests.get(url, timeout=5)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            for img in soup.find_all('img'):
                img_url = img.get('src')
                if img_url is not None and 'http' in img_url:
                    download_image(img_url, save_dir)
        else:
            print('Failed to crawl images from: %s' % url)
    except Exception as e:
        print('Error: %s' % e)

url = 'https://www.example.com/images/'
save_dir = './images/'
if not os.path.exists(save_dir):
    os.mkdir(save_dir)
crawl_images(url, save_dir)

通过以上程序，可以爬取指定URL下的所有图片并保存到本地。程序中使用了Requests库和BeautifulSoup库来实现爬虫的功能。其中，download_image函数负责下载单个图片，crawl_images函数负责爬取整个网站中的所有图片。

示例1：爬取cn.bing.com上的每日一图

import requests
import os

def download_image(url, save_dir):
    try:
        response = requests.get(url, timeout=5)
        if response.status_code == 200:
            file_name = url.split('/')[-1]
            file_path = os.path.join(save_dir, file_name)
            with open(file_path, 'wb') as f:
                f.write(response.content)
                print('Downloaded image: %s' % url)
        else:
             print('Failed to download image: %s' % url)
    except Exception as e:
        print('Error: %s' % e)

def crawl_images(url, save_dir):
    try:
        response = requests.get(url, timeout=5)
        if response.status_code == 200:
            image_url = 'https://cn.bing.com' + response.json()['images'][0]['url']
            download_image(image_url, save_dir)
        else:
            print('Failed to crawl images from: %s' % url)
    except Exception as e:
        print('Error: %s' % e)

url = 'https://cn.bing.com/HPImageArchive.aspx?format=js&idx=0&n=1'
save_dir = './images/'
if not os.path.exists(save_dir):
    os.mkdir(save_dir)
crawl_images(url, save_dir)

通过此程序可以爬取bing.com的每日一图，并保存到指定的文件夹内。

示例2：爬取Unsplash上的高清壁纸

import requests
from bs4 import BeautifulSoup
import os

def download_image(url, save_dir):
    try:
        response = requests.get(url, timeout=5)
        if response.status_code == 200:
            file_name = url.split('/')[-1]
            file_path = os.path.join(save_dir, file_name)
            with open(file_path, 'wb') as f:
                f.write(response.content)
                print('Downloaded image: %s' % url)
        else:
             print('Failed to download image: %s' % url)
    except Exception as e:
        print('Error: %s' % e)

def crawl_images(url, save_dir):
    try:
        response = requests.get(url, timeout=5)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            for img in soup.find_all('img'):
                img_url = img.get('src')
                if 'photo-' in img_url:
                    img_url = 'https:' + img_url.split('?')[0] + '?auto=compress&cs=tinysrgb&h=750&w=1260'
                    download_image(img_url, save_dir)
        else:
            print('Failed to crawl images from: %s' % url)
    except Exception as e:
        print('Error: %s' % e)

url = 'https://unsplash.com/s/photos/wallpapers'
save_dir = './wallpapers/'
if not os.path.exists(save_dir):
    os.mkdir(save_dir)
crawl_images(url, save_dir)

通过此程序可以爬取Unsplash上的高清壁纸，并保存到指定的文件夹内。注意，此网站的图片URL需要处理后才能直接下载使用。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python爬虫爬取图片的简单代码 - Python技术站