用Electron写个带界面的nodejs爬虫的实现方法

Electron是一个开源框架，它能够让开发者使用Web技术（如HTML、CSS和JavaScript）创建跨平台应用程序。这里将介绍如何使用Electron构建一个带界面的nodejs爬虫应用程序的实现方法：

1. 安装Electron

首先需要安装和设置Electron，可参考Electron官方文档进行安装。

2. 创建项目

在本地创建爬虫项目，新建一个文件夹，如“electron-crawler”，进入该文件夹并在命令行中打开，输入以下命令进行初始化：

npm init
npm install --save electron

3. 引入爬虫模块

在项目中引入nodejs爬虫模块，可以选择puppeteer或cheerio等工具。这里以puppeteer为例，安装命令如下：

npm install --save puppeteer

4. 构建界面

使用HTML、CSS和JavaScript构建界面。

例如，创建一个index.html文件：

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <title>Electron爬虫应用</title>
  </head>
  <body>
    <div>
      <input id="url-input" type="text" placeholder="请输入要爬取的网站URL">
      <button id="start-btn">开始爬取</button>
    </div>
    <div>
      <ul id="result-list"></ul>
    </div>
    <script src="./renderer.js"></script>
  </body>
</html>

创建一个renderer.js文件，用于处理界面逻辑和与爬虫模块交互：

const { ipcRenderer } = require('electron')
const puppeteer = require('puppeteer')

// 选择开始爬取按钮
const startBtn = document.querySelector('#start-btn')

// 鼠标点击开始爬取按钮时，调用爬虫函数开始执行
startBtn.addEventListener('click', async () => {
  const urlInput = document.querySelector('#url-input')
  const url = urlInput.value
  const result = await startCrawling(url)
  ipcRenderer.send('crawling-done', result)
})

// 爬虫函数，使用puppeteer打开浏览器爬取数据
const startCrawling = async (url) => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto(url)
  const result = await page.content()
  await browser.close()
  return result
}

// 接收来自主进程的消息并将爬取结果添加到界面上
ipcRenderer.on('crawling-done', (event, result) => {
  const resultList = document.querySelector('#result-list')
  resultList.innerHTML = result
})

5. 启动应用程序

使用Electron创建主进程文件main.js，进行应用程序的启动和窗口的设置。开启窗口时加载index.html文件。

const {app, BrowserWindow, ipcMain} = require('electron')
const path = require('path')

// 窗口对象
let mainWindow = null

// 创建窗口
function createWindow () {
  mainWindow = new BrowserWindow({
    width: 800,
    height: 600,
    webPreferences: {
      nodeIntegration: true,
      preload: path.join(__dirname, 'preload.js')
    }
  })

  // 加载界面
  mainWindow.loadFile('index.html')

  // 打开开发者工具
  mainWindow.webContents.openDevTools()

  // 窗口关闭事件
  mainWindow.on('closed', function () {
    mainWindow = null
  })
}

// 应用程序就绪
app.on('ready', function () {
  createWindow()
})

// 应用程序退出事件
app.on('window-all-closed', function () {
  if (process.platform !== 'darwin') {
    app.quit()
  }
})

// 接收来自渲染进程的消息，调用爬虫函数并返回结果
ipcMain.on('crawling-done', (event, result) => {
  console.log(result)
})

示例1：爬取百度首页

首先在renderer.js文件中修改startCrawling函数，将其作为一个异步函数。

// 爬取百度首页
const startCrawling = async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto('https://www.baidu.com')
  const result = await page.content()
  await browser.close()
  return result
}

然后启动一个Electron应用程序进行测试。

输入以下命令启动应用程序：

electron .

在应用程序界面中输入要爬取的网址，点击开始爬取按钮，可以看到百度首页的HTML内容被爬取并显示在结果列表中。

示例2：爬取房价信息

在renderer.js文件中修改startCrawling函数，使用cheerio选择器库解析HTML并爬取深圳市的房价信息。

const cheerio = require('cheerio')

// 爬取深圳市房价信息
const startCrawling = async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto('http://fangjia.szhome.com/')
  const result = await page.content()

  const $ = cheerio.load(result)
  const resultArr = []
  $('.table_ul li').each((i, el) => {
    const title = $(el).find('.city_name a').text()
    const price = $(el).find('.average_price').text()
    resultArr.push(`${title}: ${price}`)
  })

  await browser.close()
  return resultArr.join('\n')
}