Node.js+jade抓取博客所有文章生成静态html文件的实例

下面我来详细讲解一下Node.js+jade抓取博客所有文章生成静态html文件的实例的完整攻略：

1. 准备工作

在进行实例前，我们需要完成几个准备工作：

安装Node.js

首先，我们需要在电脑上安装Node.js。这个比较简单，在Node.js官网上下载对应操作系统的安装包，然后一路点击安装即可。

初始化Node项目

在命令行中通过npm init命令初始化一个Node项目：

npm init

在初始化过程中需要设置项目的名称、版本、入口文件等信息，按照提示填写即可。

安装所需依赖

接着我们需要安装所需的依赖，包括：

express：一个Node.js框架，提供了基本的Web开发功能；
request：Node.js的HTTP客户端库，用于发起HTTP请求；
cheerio：一个基于jQuery的Node.js爬虫库，用于解析HTML文档；
jade：一个高性能的模板引擎，用于生成HTML文档。

在命令行中通过npm install命令安装以上各项依赖：

npm install express request cheerio jade --save

2. 实现爬取网站数据并生成静态html文件

在完成准备工作后，我们可以开始具体实现爬取网站数据并生成静态html文件的功能了。

实现爬取功能

要实现爬取网站的功能，我们可以借助request和cheerio库。使用request库发起HTTP请求，获取网页HTML文档；然后使用cheerio解析HTML文档，获取我们需要的数据。

const request = require('request');
const cheerio = require('cheerio');

const url = 'http://www.example.com';

request(url, (error, response, body) => {
  if (!error && response.statusCode == 200) {
    const $ = cheerio.load(body);
    // 获取网页标题
    const title = $('title').text();
    console.log(title);
  }
});

上述代码中，我们使用request发起了一个HTTP请求，获取了指定URL的HTML文档。然后使用cheerio库解析HTML文档，并获取网页标题。

实现生成静态html文件功能

要实现生成静态html文件的功能，我们可以使用jade模板引擎。jade可以将我们的数据渲染到模板中，并生成HTML文档。

const jade = require('jade');
const fs = require('fs');

const data = {
  title: 'example',
  content: 'Hello, World!'
};

const html = jade.renderFile('./template.jade', data);

fs.writeFile('example.html', html, (err) => {
  if (err) throw err;
  console.log('HTML file generated!');
});

上述代码中，我们使用jade模板引擎将我们的数据渲染到模板中，并生成HTML文档。最后使用fs模块将HTML文档写入文件中。

3. 示例说明

下面给出两个使用Node.js+jade抓取博客所有文章生成静态html文件的实例示例说明。

示例一

这是一个简单的博客爬虫程序，可以抓取指定博客的所有文章，并生成对应的HTML文件。

const request = require('request');
const cheerio = require('cheerio');
const jade = require('jade');
const fs = require('fs');

const url = 'http://www.example.com';
const base = 'http://www.example.com';
const data = { articles: [] };

request(url, (error, response, body) => {
  if (!error && response.statusCode == 200) {
    const $ = cheerio.load(body);
    // 获取文章列表
    const $articles = $('article');
    $articles.each((i, article) => {
      const $a = $(article).find('.title a');
      const $p = $(article).find('.excerpt p');
      const title = $a.text();
      const link = base + $a.attr('href');
      const content = $p.text();
      data.articles.push({ title, link, content });
    });
    // 渲染模板并生成HTML文件
    const html = jade.renderFile('./template.jade', data);
    fs.writeFile('example.html', html, (err) => {
      if (err) throw err;
      console.log('HTML file generated!');
    });
  }
});

上述代码中，我们使用了request和cheerio库获取了指定博客的HTML文档，并抓取了文章的标题、链接和内容。然后使用jade模板引擎，将抓取到的文章渲染到模板中，并生成HTML文件。

示例二

这是一个更加复杂的博客爬虫程序，可以抓取多个指定博客的所有文章，并根据分类生成对应的HTML文件。

const request = require('request');
const cheerio = require('cheerio');
const jade = require('jade');
const fs = require('fs');
const async = require('async');

const blogs = [
  { name: 'example1', url: 'http://www.example1.com', category: 'tech' },
  { name: 'example2', url: 'http://www.example2.com', category: 'food' },
  { name: 'example3', url: 'http://www.example3.com', category: 'fashion' }
];

const data = {};

async.each(blogs, (blog, callback) => {
  request(blog.url, (error, response, body) => {
    if (!error && response.statusCode == 200) {
      const $ = cheerio.load(body);
      const $articles = $('article');
      if (!data[blog.category]) {
        data[blog.category] = { name: blog.category, articles: [] };
      }
      $articles.each((i, article) => {
        const $a = $(article).find('.title a');
        const $p = $(article).find('.excerpt p');
        const title = $a.text();
        const link = blog.url + $a.attr('href');
        const content = $p.text();
        data[blog.category].articles.push({ title, link, content });
      });
      callback();
    } else {
      callback(error);
    }
  });
}, (err) => {
  if (err) {
    console.error(err);
  } else {
    for (const category in data) {
      const html = jade.renderFile(
        './template.jade',
        { category: data[category] }
      );
      fs.writeFile(category + '.html', html, (err) => {
        if (err) throw err;
        console.log(category + ' HTML file generated!');
      });
    }
  }
});

上述代码中，我们使用了async库管理多个博客的抓取任务，并使用cheerio库获取文章的标题、链接和内容。然后按照分类使用jade模板引擎，将抓取到的文章渲染到模板中，并生成对应的HTML文件。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Node.js+jade抓取博客所有文章生成静态html文件的实例 - Python技术站