教你快速上手Selenium爬虫,万物皆可爬

简介

Selenium是一个自动化测试工具，除了用于浏览器测试外，它也可以被用于网络爬虫中。Selenium驱动程序可以模拟人类用户在网页上的操作，例如：点击链接、滚动页面、填写表单、执行JS代码等。Selenium可以在各大主流浏览器中实现自动化操作，包括Chrome、Firefox、Edge、Safari等。

在网络爬虫中，Selenium可用于那些需要js动态加载，或需要模拟登录才能获取数据的网站。Selenium通过模拟人的操作方式，能够绕过一些反爬虫机制，被广泛应用于网络爬虫中。本文将介绍使用Selenium实现爬虫的基本步骤和注意事项。

使用Selenium的基本步骤

第一步：安装浏览器驱动程序

Selenium是需要依赖浏览器驱动程序才能正常工作的。在网上搜索相关文档可以找到对应浏览器的驱动程序下载地址。

第二步：安装Selenium库

可以通过pip安装：pip install selenium

第三步：编写代码

以爬取百度搜索结果为例，写出以下代码：

from selenium import webdriver

# 启动chrome浏览器
driver = webdriver.Chrome()

# 打开百度
driver.get('https://www.baidu.com')

# 找到输入框，输入关键词并提交
input = driver.find_element_by_id('kw')
input.send_keys('python')
input.submit()

# 获取搜索结果
results = driver.find_elements_by_css_selector('.result .t a')
for result in results:
    print(result.get_attribute('href'), result.text)

# 关闭浏览器
driver.quit()

代码解释：

webdriver.Chrome()：实例化Chrome的驱动程序，如果你使用其他浏览器，将后面的“Chrome”改成对应的驱动程序的名称即可。
driver.get(url)：访问指定的网址。
driver.find_element_by_xxx(selector)：查找页面上符合selector选择器的元素。常用的选择器有：id、xpath、css。
element.send_keys(keys)：在文本框内输入文本。
element.submit()：提交表单。
driver.find_elements_by_xxx(selector)：查找多个元素，返回一个列表。
element.get_attribute(attr_name)：获取元素的指定属性的值。
driver.quit()：关闭浏览器。

注意事项

处理动态加载

在浏览网页时，有一些元素可能是动态加载的，即在网页加载完成后需要通过js动态加载出来，这个时候可以使用Selenium的time.sleep()方法等待一段时间，也可以使用driver.implicitly_wait(time_to_wait)等待元素出现。

破解验证码

对于一些需要验证码才能继续访问的网站，可以通过手动识别验证码或使用第三方验证码识别服务解决。

处理反爬机制

一些网站会采取反爬机制，为了规避这些机制，可以采取以下方法：

模拟真实人类操作，不要太快
模拟多种浏览器访问
避免频繁访问同一个网站

示例说明

示例一：爬取QQ音乐热门歌单

from selenium import webdriver

# 启动chrome浏览器
driver = webdriver.Chrome()

# 打开QQ音乐热门歌单
driver.get('https://y.qq.com/n/yqq/playsquare/6354844333.html#stat=y_new.index.playlist.pic')

# 选中“歌单最新”标签页
tab = driver.find_element_by_css_selector('.tab__nav li:last-child a')
tab.click()

# 获取歌单列表
songs = driver.find_elements_by_css_selector('.songlist__list li')

# 输出歌单信息
for song in songs:
    rank = song.find_element_by_class_name('songlist__item_rank').text
    name = song.find_element_by_class_name('songlist__item_name').text
    singer = song.find_element_by_class_name('songlist__item_singer').text
    print(rank, name, singer)

# 关闭浏览器
driver.quit()

示例二：使用Selenium进行模拟登录

from selenium import webdriver
import time

# 启动chrome浏览器
driver = webdriver.Chrome()

# 打开CSDN网站
driver.get('https://passport.csdn.net/login')

# 切换到帐号密码登录
login_tab = driver.find_element_by_css_selector('.login-tab .js-login-form>[data-type="account"]')
login_tab.click()

# 输入用户名和密码
input_username = driver.find_element_by_css_selector('input[name=username]')
input_username.send_keys('your_username')
input_password = driver.find_element_by_css_selector('input[name=password]')
input_password.send_keys('your_password')

# 点击登录
btn_login = driver.find_element_by_css_selector('.btn.btn-primary.btn-block')
btn_login.click()

# 休眠3秒让网页刷新
time.sleep(3)

# 获取登录后的用户名
username = driver.find_element_by_css_selector('.header-user-nav span').text
print('Logged in as:', username)

# 关闭浏览器
driver.quit()

以上示例均是采用Selenium库模拟浏览器操作，实现信息抓取和模拟登录的功能。自行运行这些代码即可更好地理解Selenium的使用方法和注意事项。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：教你快速上手Selenium爬虫,万物皆可爬 - Python技术站

教你快速上手Selenium爬虫,万物皆可爬

教你快速上手Selenium爬虫,万物皆可爬

简介

使用Selenium的基本步骤

第一步：安装浏览器驱动程序

第二步：安装Selenium库

第三步：编写代码

注意事项

处理动态加载

破解验证码

处理反爬机制

示例说明

示例一：爬取QQ音乐热门歌单

示例二：使用Selenium进行模拟登录

相关文章