针对Java精确抽取网页发布时间,下面是完整的攻略,包含以下几个步骤:
1. 获取HTML网页源代码
使用HttpClient或Jsoup等网络库,向目标网页发送请求,获取返回的HTML文本内容。
示例1-使用HttpClient获取HTML网页源代码:
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
public class HtmlSourceExtractor {
public static String getHtml(String url) throws Exception {
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet(url);
String htmlContent;
try (CloseableHttpResponse response = httpClient.execute(httpGet)) {
HttpEntity entity = response.getEntity();
htmlContent = EntityUtils.toString(entity, "UTF-8");
}
return htmlContent;
}
}
示例2-使用Jsoup获取HTML网页源代码:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class HtmlSourceExtractor {
public static String getHtml(String url) throws Exception {
Document doc = Jsoup.connect(url).get();
String htmlContent = doc.html();
return htmlContent;
}
}
2. 利用正则表达式匹配网页发布时间
在获取到HTML文本内容后,选择合适的正则表达式,匹配出发布时间信息。常用的时间格式包括:yyyy-MM-dd HH:mm:ss, yyyy/MM/dd HH:mm:ss, yyyy年MM月dd日 HH:mm:ss等。
示例1-通过正则表达式抽取京东商品发布时间:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TimeExtractor {
private static final String JD_TIME_REGEX = "itemprop=\"datePublished\" content=\"(.*?)\"";
public static String extractTimeFromJdHtml(String htmlContent) {
Pattern pattern = Pattern.compile(JD_TIME_REGEX);
Matcher matcher = pattern.matcher(htmlContent);
if (matcher.find()) {
return matcher.group(1);
}
return null;
}
}
示例2-通过正则表达式抽取知乎问题发布时间:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TimeExtractor {
private static final String ZHIHU_TIME_REGEX = "<span class=\"MetaItem\">\n"
+ "\\s*(.+?)\n"
+ "\\s*</span>";
public static String extractTimeFromZhihuHtml(String htmlContent) {
Pattern pattern = Pattern.compile(ZHIHU_TIME_REGEX, Pattern.DOTALL);
Matcher matcher = pattern.matcher(htmlContent);
if (matcher.find()) {
String timeStr = matcher.group(1);
return timeStr.replaceAll("\\s+年\\s+", "-")
.replaceAll("\\s+月\\s+", "-")
.replaceAll("\\s+日\\s+", " ")
.replaceAll("上午|下午", "");
}
return null;
}
}
3. 转换时间格式为标准格式
将匹配到的时间字符串转换为标准的日期时间格式,例如用Java的SimpleDateFormat类进行格式化。
示例1-将京东商品发布时间转换为标准日期时间格式:
import java.text.SimpleDateFormat;
import java.util.Date;
public class TimeFormatConverter {
private static final String JD_TIME_PATTERN = "yyyy-MM-dd HH:mm:ss";
public static Date convertJdTimeToStandardFormat(String jdTimeStr) throws Exception {
SimpleDateFormat sdf = new SimpleDateFormat(JD_TIME_PATTERN);
return sdf.parse(jdTimeStr);
}
}
示例2-将知乎问题发布时间转换为标准日期时间格式:
import java.text.SimpleDateFormat;
import java.util.Date;
public class TimeFormatConverter {
private static final String ZHIHU_TIME_PATTERN = "yyyy-MM-dd HH:mm:ss";
public static Date convertZhihuTimeToStandardFormat(String zhihuTimeStr) throws Exception {
SimpleDateFormat sdf = new SimpleDateFormat(ZHIHU_TIME_PATTERN);
return sdf.parse(zhihuTimeStr);
}
}
4. 完整代码
综上所述,完整的Java代码如下所示:
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.lang3.StringUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class TimeExtractor {
public static void main(String[] args) throws Exception {
String jdUrl = "https://item.jd.com/100011288958.html";
String jdHtmlContent = HtmlSourceExtractor.getHtml(jdUrl);
String jdTimeStr = extractTimeFromJdHtml(jdHtmlContent);
System.out.println("[京东] 发布时间为:" + jdTimeStr);
Date jdTime = convertJdTimeToStandardFormat(jdTimeStr);
System.out.println("[京东] 转换后时间为:" + jdTime);
String zhihuUrl = "https://www.zhihu.com/question/471734788/answer/2009389214";
String zhihuHtmlContent = HtmlSourceExtractor.getHtml(zhihuUrl);
String zhihuTimeStr = extractTimeFromZhihuHtml(zhihuHtmlContent);
System.out.println("[知乎] 发布时间为:" + zhihuTimeStr);
Date zhihuTime = convertZhihuTimeToStandardFormat(zhihuTimeStr);
System.out.println("[知乎] 转换后时间为:" + zhihuTime);
}
public static String extractTimeFromJdHtml(String htmlContent) {
String jdTimeRegex = "itemprop=\"datePublished\" content=\"(.*?)\"";
Pattern pattern = Pattern.compile(jdTimeRegex);
Matcher matcher = pattern.matcher(htmlContent);
if (matcher.find()) {
return matcher.group(1);
}
return null;
}
public static String extractTimeFromZhihuHtml(String htmlContent) {
String zhihuTimeRegex = "<span class=\"MetaItem\">\n"
+ "\\s*(.+?)\n"
+ "\\s*</span>";
Pattern pattern = Pattern.compile(zhihuTimeRegex, Pattern.DOTALL);
Matcher matcher = pattern.matcher(htmlContent);
if (matcher.find()) {
String timeStr = matcher.group(1);
return timeStr.replaceAll("\\s+年\\s+", "-")
.replaceAll("\\s+月\\s+", "-")
.replaceAll("\\s+日\\s+", " ")
.replaceAll("上午|下午", "");
}
return null;
}
public static Date convertJdTimeToStandardFormat(String jdTimeStr) throws Exception {
if (StringUtils.isBlank(jdTimeStr)) {
return null;
}
String jdTimePattern = "yyyy-MM-dd HH:mm:ss";
SimpleDateFormat sdf = new SimpleDateFormat(jdTimePattern);
return sdf.parse(jdTimeStr);
}
public static Date convertZhihuTimeToStandardFormat(String zhihuTimeStr) throws Exception {
if (StringUtils.isBlank(zhihuTimeStr)) {
return null;
}
String zhihuTimePattern = "yyyy-MM-dd HH:mm:ss";
SimpleDateFormat sdf = new SimpleDateFormat(zhihuTimePattern);
return sdf.parse(zhihuTimeStr);
}
public static String getHtml(String url) throws Exception {
Document doc = Jsoup.connect(url).get();
return doc.html();
}
}
以上就是Java精确抽取网页发布时间的完整攻略。
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:Java精确抽取网页发布时间 - Python技术站