Java爬虫信息抓取的实现

Java爬虫可以通过模拟浏览器的行为，自动化地访问网页并抓取所需信息，主要分为以下几个步骤：

1. 简述Web爬虫的基本工作流程

1.1 网页访问

要抓取的信息一般都在网页中，因此第一步是访问目标网站。由于Java爬虫需要模拟浏览器的行为，因此一般使用java.net.HttpURLConnection或org.apache.http.client.HttpClient等工具类进行网络请求。

1.2 网页内容解析

访问到网页后，就需要对其内容进行解析。HTML网页内容一般使用org.jsoup.Jsoup等工具进行解析，而JSON格式的内容可以使用com.alibaba.fastjson.JSONObject进行解析。

1.3 数据提取

解析到了网页中的内容，就需要从中提取出我们所需的数据，这个过程一般使用正则表达式或XPath表达式进行匹配。

1.4 数据保存

提取到所需数据后，需要把它保存下来，这可以使用Java中提供的文件操作API或者数据库进行保存。

2. 爬虫实现的具体细节

2.1 构造URL

访问Web页面需要构造URL，而构造URL需要了解目标网站的URL规则。根据目标网页的URL规律，可以利用Java中提供的java.net.URL类将构造出来的URL转换为URL对象，从而方便地进行后续的处理。

2.2 模拟浏览器

模拟浏览器行为是爬虫实现的核心，一般要做到以下几点：

伪装User-Agent头

为了让目标网站以为我们是一个正常的浏览器，需要在请求头中添加User-Agent，一般使用Mozilla或Chrome。

connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36");

保存Cookie

大部分网站需要登录才能访问其它页面，这就需要保存Cookie。可以使用Java中提供的java.net.CookieManager类来管理Cookie。

CookieManager cookieManager = new CookieManager();
cookieManager.setCookiePolicy(CookiePolicy.ACCEPT_ALL);
CookieHandler.setDefault(cookieManager);

处理重定向

有些页面会重定向到其它页面，这就需要处理重定向。可以使用Java中HttpURLConnection.getResponseCode()方法判断返回码是否为302，如果是，就取出重定向的URL再次进行请求。

if (httpConnection.getResponseCode() == HttpURLConnection.HTTP_MOVED_TEMP) {
    String redirectUrl = httpConnection.getHeaderField("Location");
    // 对重定向的URL再次进行请求
}

2.3 解析网页内容

Java爬虫主要针对HTML和JSON两种格式的数据进行爬取。

解析HTML

可以使用org.jsoup.Jsoup类进行HTML的解析，它可以简单地定位到指定的标签并获取其中的内容：

Document document = Jsoup.parse(htmlString);
Elements links = document.select("#div_id .a_class");
for (Element link : links) {
    String href = link.attr("href");
    String text = link.text();
}

解析JSON

可以使用阿里巴巴的com.alibaba.fastjson.JSONObject类实现对JSON数据的解析：

JSONObject jsonObject = JSONObject.parseObject(jsonString);
String name = jsonObject.getString("name");
int age = jsonObject.getIntValue("age");
JSONObject companyJson = jsonObject.getJSONObject("company");

2.4 数据提取

在解析到网页内容后，就需要从中提取出我们想要的数据。这可以使用正则表达式或XPath表达式进行匹配。

正则表达式

Java中提供的java.util.regex包提供了对正则表达式的支持，可以通过正则表达式来提取目标数据。

Pattern pattern = Pattern.compile("正则表达式");
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
    String result = matcher.group();
}

XPath表达式

XPath是一种在XML文档中选取内容的语言，可以通过XPath来提取HTML文档中的数据。

XPath xpath = XPathFactory.newInstance().newXPath();
Node node = (Node) xpath.evaluate("XPath表达式", new InputSource(new StringReader(content)), XPathConstants.NODE);
String result = node.getTextContent();

3.示例说明

下面给出两个Java爬虫的示例说明：

3.1 爬取百度搜索结果

public static void main(String[] args) throws IOException {
    String keyword = "Java爬虫";
    String url = "https://www.baidu.com/s?wd=" + keyword;
    HttpURLConnection httpConnection = (HttpURLConnection) new URL(url).openConnection();
    httpConnection.setRequestMethod("GET");
    httpConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36");
    httpConnection.setInstanceFollowRedirects(false);
    Map<String, List<String>> headers = httpConnection.getHeaderFields();
    if (headers.get("Location") != null) {
        url = headers.get("Location").get(0);
        httpConnection = (HttpURLConnection) new URL(url).openConnection();
        httpConnection.setRequestMethod("GET");
        httpConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36");
    }
    InputStream inputStream = httpConnection.getInputStream();
    BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));
    String line;
    StringBuilder stringBuilder = new StringBuilder();
    while ((line = bufferedReader.readLine()) != null) {
        stringBuilder.append(line + "\n");
    }
    bufferedReader.close();
    inputStream.close();
    Document document = Jsoup.parse(stringBuilder.toString());
    Elements elements = document.select("#content_left h3.t a");
    for (Element element : elements) {
        String title = element.text();
        String link = element.attr("href");
        System.out.println(title + " -> " + link);
    }
}

3.2 爬取豆瓣图书信息

public static void main(String[] args) throws IOException, XPathExpressionException {
    String url = "https://book.douban.com/subject/1084336/";
    HttpURLConnection httpConnection = (HttpURLConnection) new URL(url).openConnection();
    httpConnection.setRequestMethod("GET");
    httpConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36");
    InputStream inputStream = httpConnection.getInputStream();
    BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));
    String line;
    StringBuilder stringBuilder = new StringBuilder();
    while ((line = bufferedReader.readLine()) != null) {
        stringBuilder.append(line);
    }
    bufferedReader.close();
    inputStream.close();
    String content = stringBuilder.toString();
    XPath xpath = XPathFactory.newInstance().newXPath();
    Node node = (Node) xpath.evaluate("//div[@id='info']", new InputSource(new StringReader(content)), XPathConstants.NODE);
    NodeList children = node.getChildNodes();
    for (int i = 0; i < children.getLength(); i++) {
        Node child = children.item(i);
        if (child.getNodeType() == Node.ELEMENT_NODE && child.getNodeName().equals("span")) {
            if (child.getTextContent().contains("出版社")) {
                String publisher = child.getNextSibling().getTextContent();
                System.out.println("出版社：" + publisher);
            }
            if (child.getTextContent().contains("原作名")) {
                String originName = child.getNextSibling().getTextContent();
                System.out.println("原作名：" + originName);
            }
        }
    }
}

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Java爬虫信息抓取的实现 - Python技术站

Java爬虫 信息抓取的实现