Python3实现的爬虫爬取数据并存入mysql数据库操作示例

简介

本攻略展示了如何使用Python3编写一个简单的爬虫程序，抓取网页数据并将其存入MySQL数据库中。

要完成本攻略，您需要有Python3和MySQL数据库的基本知识，并安装好相应的Python库：requests、beautifulsoup4、pymysql。

前置准备

安装Python3：请到Python官网下载最新版Python3并按照安装向导进行安装。
安装MySQL数据库：请到MySQL官网下载最新版MySQL并按照安装向导进行安装。安装好之后，请创建一个名为crawl的数据库，并在该数据库下创建一个名为news的数据表。news表的表结构如下：

CREATE TABLE `news` ( `id` int(11) NOT NULL AUTO_INCREMENT, `title` varchar(255) NOT NULL, `url` varchar(255) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

该数据表包含三个字段：id（主键）、title（新闻标题）、url（新闻链接）。

安装Python库：在命令行中分别执行以下命令：

pip install requests pip install beautifulsoup4 pip install pymysql

爬虫程序

以下是爬虫程序的源代码：

import requests
from bs4 import BeautifulSoup
import pymysql

# 构造请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36'
}

# 爬取页面，返回页面内容
def crawl(url):
    response = requests.get(url, headers=headers)
    response.encoding = response.apparent_encoding
    return response.text

# 解析页面，返回新闻列表
def parse(html):
    soup = BeautifulSoup(html, 'html.parser')
    news_list = []
    for item in soup.select('div.news-item'):
        news = {}
        news['title'] = item.select_one('h3 a').text.strip()
        news['url'] = item.select_one('h3 a')['href']
        news_list.append(news)
    return news_list

# 将新闻列表保存到数据库中
def save(news_list):
    conn = pymysql.connect(host='localhost', user='root', password='123456', db='crawl', charset='utf8')
    try:
        with conn.cursor() as cursor:
            for news in news_list:
                sql = 'INSERT INTO news (title, url) VALUES (%s, %s)'
                cursor.execute(sql, (news['title'], news['url']))
            conn.commit()
    finally:
        conn.close()

# 爬取网页，解析新闻列表，保存到数据库中
def main():
    url = 'https://www.baidu.com/s?wd=news'
    html = crawl(url)
    news_list = parse(html)
    save(news_list)

if __name__ == '__main__':
    main()

示例说明

以下是两个关于如何抓取新的网页内容更新到数据库的代码示例：

代码示例一：抓取时间戳之后更新数据库

import time

# 将新闻列表保存到数据库中
def save(news_list):
    conn = pymysql.connect(host='localhost', user='root', password='123456', db='crawl', charset='utf8')
    try:
        with conn.cursor() as cursor:
            # 先查询数据库中的最新记录的时间戳
            cursor.execute('SELECT MAX(add_time) FROM news')
            last_add_time = cursor.fetchone()[0]
            # 如果没有记录，则将时间戳设为0
            if not last_add_time:
                last_add_time = 0
            # 遍历新闻列表，将新增的新闻插入数据库中
            for news in news_list:
                add_time = int(time.time())
                if add_time > last_add_time:
                    sql = 'INSERT INTO news (title, url, add_time) VALUES (%s, %s, %s)'
                    cursor.execute(sql, (news['title'], news['url'], add_time))
                    conn.commit()
    finally:
        conn.close()

代码示例二：判断新闻链接是否已在数据库中存在

# 将新闻列表保存到数据库中
def save(news_list):
    conn = pymysql.connect(host='localhost', user='root', password='123456', db='crawl', charset='utf8')
    try:
        with conn.cursor() as cursor:
            # 遍历新闻列表，判断每个新闻链接是否已存在于数据库中
            for news in news_list:
                sql = 'SELECT COUNT(*) FROM news WHERE url=%s'
                cursor.execute(sql, news['url'])
                count = cursor.fetchone()[0]
                # 如果新闻链接不存在于数据库中，则插入该新闻
                if count == 0:
                    sql = 'INSERT INTO news (title, url) VALUES (%s, %s)'
                    cursor.execute(sql, (news['title'], news['url']))
                    conn.commit()
    finally:
        conn.close()

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python3实现的爬虫爬取数据并存入mysql数据库操作示例 - Python技术站

Python3实现的爬虫爬取数据并存入mysql数据库操作示例

Python3实现的爬虫爬取数据并存入mysql数据库操作示例

简介

前置准备

爬虫程序

示例说明

相关文章