Nodejs实现定时爬虫的完整实例

下面是" Nodejs实现定时爬虫的完整实例 "的完整攻略：

简介

本文将介绍如何使用 Nodejs 实现一个定时爬虫的完整实例。本文将涵盖以下方面：
- 单次爬虫的实现方法
- 定时任务的实现方法
- 着重讲解使用 node-schedule 实现定时任务的方法
- 代码的分析

单次爬虫的实现方法

使用 Nodejs 实现一个爬虫，需要借助一个第三方库 cheerio。cheerio 可以将页面转化为 HTML，实现快速的 DOM 操作，并且它的使用方法和 jQuery 非常相似。

本文以爬取天气预报的数据为例，其中天气预报的数据来源于中国气象局。

首先需要安装依赖：

npm install axios cheerio

然后，我们通过 axios 爬取页面信息，使用 cheerio 提取需要的数据，并将数据以 JSON 的形式输出。代码如下：

// 引入依赖
const axios = require('axios');
const cheerio = require('cheerio');

// 爬取页面信息
axios.get('http://www.weather.com.cn/weather/101200101.shtml').then(res => {

    // 将页面转化为 HTML
    const $ = cheerio.load(res.data);

    // 提取需要的数据
    const city = $('.crumbs.fl > a').eq(1).text(); // 城市
    const temperature = $('.tem > span').eq(0).text(); // 气温
    const weather = $('.wea').eq(0).text(); // 天气

    // 将数据以 JSON 的形式输出
    console.log({
        city,
        temperature,
        weather,
    });
}).catch(err => console.error(err));

运行以上代码，将会输出以下结果：

{
    city: '武汉',
    temperature: '6℃',
    weather: '多云转晴',
}

我们可以看到，通过以上代码实现了单次爬虫的简单实例。

定时任务的实现方法

首先，我们需要了解 node-schedule 的基本用法，可以参考官方文档。

首先，需要安装依赖：

npm install node-schedule

然后，我们可以通过以下代码实现每分钟执行一次任务：

const schedule = require('node-schedule');

schedule.scheduleJob('*/1 * * * *', function() {
    console.log('The answer to life, the universe, and everything!');
});

以上代码将会在每分钟执行一次，输出 "The answer to life, the universe, and everything!"。

在定时任务中，我们可以再次使用爬虫的方式获取数据，并且可以使用文件流将数据写入本地文件。下面是代码示例：

const fs = require('fs');
const axios = require('axios');
const cheerio = require('cheerio');
const schedule = require('node-schedule');

const filePath = './data.json';

const spider = () => {
    axios.get('http://www.weather.com.cn/weather/101200101.shtml').then(res => {
        const $ = cheerio.load(res.data);
        const city = $('.crumbs.fl > a').eq(1).text();
        const temperature = $('.tem > span').eq(0).text();
        const weather = $('.wea').eq(0).text();

        const data = {
            city,
            temperature,
            weather,
            date: new Date(),
        };

        fs.writeFile(filePath, JSON.stringify(data), { flag: 'a' }, function (err) {
            if (err) {
                console.error(err);
            } else {
                console.log('success');
            }
        });
    }).catch(err => console.error(err));
};

const job = schedule.scheduleJob('*/10 * * * * *', spider); // 每隔 10 秒执行一次任务

以上代码将会在每隔 10 秒执行一次任务，将爬取到的数据以 JSON 的形式写入./data.json 文件中。

代码分析

以上代码中，我们引入了 axios、cheerio、node-schedule 三个依赖。
在单次爬虫的实现方法中，我们使用 axios 爬取数据，使用 cheerio 提取数据。
在定时任务的实现方法中，我们通过 schedule.scheduleJob() 方法实现了定时任务，并且在任务中调用了爬虫的方法，将爬取到的数据写入了本地文件。
在爬虫的方法 spider() 中，我们将数据保存在 data 变量中，并将 data 变量以 JSON 的形式写入了文本文件中。由于我们希望每次运行都将数据追加到文本末尾，因此我们使用了 { flag: 'a' } 将文件写入方式设置为追加。

示例说明

下面是两个 Nodejs 实现定时爬虫的完整示例：

示例一

实现方式：每隔 10 秒钟爬取一次头条新闻，并将数据输出到控制台。

const axios = require('axios');
const cheerio = require('cheerio');
const schedule = require('node-schedule');

const spider = () => {
    axios.get('https://www.toutiao.com/').then(res => {
        const $ = cheerio.load(res.data);
        const title = $('title').text(); // 获取网页标题
        console.log(`${new Date()}: ${title}`);
    }).catch(err => console.error(err));
};

const job = schedule.scheduleJob('*/10 * * * * *', spider); // 每隔 10 秒执行一次任务

示例二

实现方式：每日早上 6 点整爬取一次全国疫情数据，并将数据以 JSON 的形式写入./data.json 文件中。

const fs = require('fs');
const axios = require('axios');
const cheerio = require('cheerio');
const schedule = require('node-schedule');

const filePath = './data.json';

const spider = () => {
    axios.get('https://ncov.dxy.cn/ncovh5/view/pneumonia').then(res => {
        const $ = cheerio.load(res.data);
        const dataScript = $('#getStatisticsService').next().html();
        const regExp = /window\.getStatisticsService\s=\s(\{.+\})\scatch/; // 通过正则表达式获取数据
        const result = JSON.parse(dataScript.match(regExp)[1]).statistics; // 解析数据

        const data = {
            confirmedCount: result.confirmedCount, // 确诊人数
            suspectedCount: result.suspectedCount, // 疑似病例数
            curedCount: result.curedCount, // 治愈人数
            deadCount: result.deadCount, // 死亡人数
            date: new Date(), // 日期
        };

        fs.writeFile(filePath, JSON.stringify(data), { flag: 'a' }, function (err) {
            if (err) {
                console.error(err);
            } else {
                console.log('success');
            }
        });
    }).catch(err => console.error(err));
};

const job = schedule.scheduleJob('0 6 * * *', spider); // 每天早上 6 点执行任务

以上就是使用 Nodejs 实现定时爬虫的完整攻略，希望对你有所帮助。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Nodejs实现定时爬虫的完整实例 - Python技术站

Nodejs实现定时爬虫的完整实例

简介

单次爬虫的实现方法

定时任务的实现方法

代码分析

示例说明

示例一

示例二

相关文章