如何使用Java爬虫批量爬取图片

如何使用 Java 爬虫批量爬取图片？

准备工作
在开始之前，需要准备以下工具：
JDK：需要安装 JDK，这里我使用的是当前最新版本 JDK 11。
IntelliJ IDEA：使用官方提供的 IntelliJ IDEA 作为开发工具。
爬取网站
首先需要找到一个合适的网站来进行图片爬取。这里我们以花瓣网为例，该网站有很多高质量的图片供我们下载：
http://huaban.com/

我们要想获得花瓣网站上的所有图片，就需要先知道图片的链接地址。通过 Chrome 浏览器的开发者工具可以查看到图片地址的规则：

图片的链接地址由两部分组成，分别是图片所在网页的链接和图片的文件名。
图片所在网页的链接可以在花瓣网站上的图钉页面中找到。
图片的文件名可以在网页的源代码中找到。
编写代码
代码的主要逻辑如下：
第一步：获取花瓣网站上的图钉页面链接
第二步：遍历所有图钉页面，在每个页面中获取所有图片的地址
第三步：遍历所有图片地址，将图片下载到本地

下面是 Java 代码示例，仅供参考：

import java.io.*;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class HuaBanCrawler {
    public static void main(String[] args) {
        String baseUrl = "http://huaban.com/";
        String prefixUrl = "http://img.hb.aicdn.com/";
        String regex = "<a href=\"(http://huaban.com/p/[0-9]+/)\"";
        String imgUrlRegex = "<img src=\"(http://img.hb.aicdn.com/[^>]*.jpg)\"";

        // 创建保存图片的目录
        String dirName = "huaban";
        File dir = new File(dirName);
        if (!dir.exists()) {
            dir.mkdir();
        }

        try(BufferedReader reader = new BufferedReader(new InputStreamReader(System.in))) {
            System.out.print("请输入您需要爬取的页面数量：");
            String str = reader.readLine();
            int pageCount = Integer.parseInt(str);

            for (int i = 1; i <= pageCount; i++) {
                String url = baseUrl + "popular/?iiz8rkh5&max=" + i + "&limit=20&wfl=1";
                String content = getContent(url);

                System.out.println("爬取第" + i + "页");

                Pattern pattern = Pattern.compile(regex);
                Matcher matcher = pattern.matcher(content);

                while (matcher.find()) {
                    String articleUrl = matcher.group(1);
                    String articleContent = getContent(articleUrl);

                    Pattern imgUrlPattern = Pattern.compile(imgUrlRegex);
                    Matcher imgUrlMatcher = imgUrlPattern.matcher(articleContent);

                    while (imgUrlMatcher.find()) {
                        String imgUrl = imgUrlMatcher.group(1);
                        String fileName = imgUrl.substring(imgUrl.lastIndexOf("/") + 1);
                        fileName = fileName.replaceAll("jpg_[0-9]*x[0-9]*.jpg", "jpg");

                        // 下载图片到本地
                        saveImage(prefixUrl + fileName, dirName + File.separator + fileName);
                    }
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * 获取网页内容
     * @param urlStr 网页地址
     * @return 返回网页内容字符串
     */
    public static String getContent(String urlStr){
        StringBuilder content = new StringBuilder();
        try {
            URL url = new URL(urlStr);
            BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(),"UTF-8"));
            String temp;
            while ((temp = in.readLine()) != null) {
                content.append(temp);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        return content.toString();
    }

    /**
     * 保存图片到本地
     * @param imgUrl 图片地址
     * @param fileName 保存到本地的文件名
     */
    public static void saveImage(String imgUrl, String fileName) {
        System.out.println("正在下载图片：" + imgUrl);

        try (BufferedInputStream in = new BufferedInputStream(new URL(imgUrl).openStream());
             FileOutputStream out = new FileOutputStream(fileName)) {

            byte[] buf = new byte[1024];
            int length = 0;
            while ((length = in.read(buf, 0, buf.length)) != -1) {
                out.write(buf, 0, length);
            }
            out.flush();
            System.out.println("图片下载完成：" + imgUrl);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

示例说明
下面给出两个使用 Java 爬虫批量爬取图片的示例说明：

示例 1：爬取天猫商品图片
我们要爬取的是天猫商品页面上的所有图片，这里以该商品为例：
https://detail.tmall.com/item.htm?spm=a1z10.1-b-static.w5003-20132274562.2.130d14e0S4elA9&id=641203769634&sku_properties=5919063:6536025

可以通过 Chrome 浏览器的开发者工具查看到该页面上的图片地址规则：

存放在 tbimg 目录下的图片链接地址格式：
https://img.alicdn.com/bao/uploaded/i4//O1CN01jAfoz91TVcyymlrT9_!!2712471981-2-hcitemgroup.png_360x360.jpg

存放在 desc 目录下的图片链接地址格式：
https://img.alicdn.com/bao/uploaded/i2/2206699943073/O1CN01JsrL3p1Fm9CQWJqTZ_!!0-item_pic.jpg

根据上述规则，我们可以正常编写 Java 代码来爬取天猫商品页面中的图片。
示例代码中的正则表达式是根据目标网站动态生成的，有需要的可以自己进行修改。

示例 2：爬取 Unsplash 网站上的高质量图片
我们要爬取的是 Unsplash 网站上的所有高质量图片，这是一个非常专业的图片网站：
https://unsplash.com/

我们可以在 Unsplash 网站的 API 页面中找到图片的地址和调用方式：

图片地址格式：
https://source.unsplash.com/random/${width}x${height}
调用方式：
例1：获取 1024x768 的随机图片
https://source.unsplash.com/random/1024x768

根据上述规则，我们可以编写 Java 代码，通过调用 API 接口来获取目标图片。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：如何使用Java爬虫批量爬取图片 - Python技术站

如何使用Java爬虫批量爬取图片

相关文章