Python常用的爬虫技巧总结

Python常用的爬虫技巧总结

在本攻略中,我们将介绍Python常用的爬虫技巧,包括如何使用requests库发送HTTP请求、如何使用BeautifulSoup库解析HTML文档、如何使用正则表达式提取数据、如何使用Selenium库模拟浏览器行为、如何使用代理IP和用户代理等技巧。我们将提供两个示例,演示如何使用这些技巧爬取网页数据。

步骤1:安装必要的库

在开始之前,我们需要安装必要的库。我们可以使用以下命令来安装这些库:

pip install requests beautifulsoup4 selenium

步骤2:使用requests库发送HTTP请求

requests库是Python中最常用的HTTP库之一,它提供了简单易用的API,可以轻松地发送HTTP请求并获取响应数据。我们可以按照以下步骤来使用requests库发送HTTP请求:

  1. 导入requests库。
import requests
  1. 发送HTTP请求并获取响应数据。
url = 'http://example.com'
response = requests.get(url)
html = response.text

在上面的代码中,我们定义了一个URL,并使用requests库的get()方法发送HTTP请求并获取响应数据。我们使用response.text属性获取响应数据的HTML文本。

步骤3:使用BeautifulSoup库解析HTML文档

BeautifulSoup库是Python中最常用的HTML解析库之一,它可以将HTML文档解析为Python对象,并提供了简单易用的API,可以轻松地提取数据。我们可以按照以下步骤来使用BeautifulSoup库解析HTML文档:

  1. 导入BeautifulSoup库。
from bs4 import BeautifulSoup
  1. 将HTML文档解析为Python对象。
soup = BeautifulSoup(html, 'html.parser')

在上面的代码中,我们使用BeautifulSoup库将HTML文档解析为Python对象。我们使用'html.parser'作为解析器。

  1. 提取数据。
title = soup.title.text

在上面的代码中,我们使用text属性获取标签的文本内容。</p> <h2>步骤4:使用正则表达式提取数据</h2> <p>正则表达式是一种强大的文本处理工具,可以用来匹配和提取文本中的数据。我们可以按照以下步骤来使用正则表达式提取数据:</p> <ol> <li>导入re库。</li> </ol> <pre><code class="language-python">import re </code></pre> <ol> <li>编写正则表达式。</li> </ol> <pre><code class="language-python">pattern = r'<title>(.*?)</title>' </code></pre> <p>在上面的代码中,我们定义了一个正则表达式,用于匹配<title>标签的文本内容。</p> <ol> <li>使用re库匹配和提取数据。</li> </ol> <pre><code class="language-python">match = re.search(pattern, html) title = match.group(1) </code></pre> <p>在上面的代码中,我们使用re库的search()方法匹配正则表达式,并使用group()方法提取匹配到的数据。</p> <h2>步骤5:使用Selenium库模拟浏览器行为</h2> <p>Selenium库是Python中最常用的Web自动化测试库之一,它可以模拟浏览器行为,包括点击、输入、滚动等操作。我们可以按照以下步骤来使用Selenium库模拟浏览器行为:</p> <ol> <li>导入Selenium库。</li> </ol> <pre><code class="language-python">from selenium import webdriver </code></pre> <ol> <li>创建浏览器对象。</li> </ol> <pre><code class="language-python">driver = webdriver.Chrome() </code></pre> <p>在上面的代码中,我们创建了一个Chrome浏览器对象。</p> <ol> <li>打开网页。</li> </ol> <pre><code class="language-python">url = 'http://example.com' driver.get(url) </code></pre> <p>在上面的代码中,我们使用get()方法打开了一个网页。</p> <ol> <li>模拟浏览器行为。</li> </ol> <pre><code class="language-python">element = driver.find_element_by_xpath('//input[@name="q"]') element.send_keys('Python') element.submit() </code></pre> <p>在上面的代码中,我们使用find_element_by_xpath()方法查找一个输入框,并使用send_keys()方法输入文本。然后,我们使用submit()方法提交表单。</p> <h2>步骤6:使用代理IP和用户代理</h2> <p>代理IP和用户代理是爬虫中常用的技巧,可以帮助我们隐藏真实的IP地址和浏览器信息,从而避免被封禁。我们可以按照以下步骤来使用代理IP和用户代理:</p> <ol> <li>定义代理IP和用户代理。</li> </ol> <pre><code class="language-python">proxies = { 'http': 'http://127.0.0.1:8888', 'https': 'https://127.0.0.1:8888' } headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3' } </code></pre> <p>在上面的代码中,我们定义了一个代理IP和一个用户代理。</p> <ol> <li>使用代理IP和用户代理发送HTTP请求。</li> </ol> <pre><code class="language-python">url = 'http://example.com' response = requests.get(url, proxies=proxies, headers=headers) html = response.text </code></pre> <p>在上面的代码中,我们使用requests库发送HTTP请求,并使用proxies参数和headers参数设置代理IP和用户代理。</p> <h2>示例1:使用requests库和BeautifulSoup库爬取网页数据</h2> <p>以下是一个示例代码,演示如何使用requests库和BeautifulSoup库爬取网页数据:</p> <pre><code class="language-python">import requests from bs4 import BeautifulSoup url = 'http://example.com' response = requests.get(url) html = response.text soup = BeautifulSoup(html, 'html.parser') title = soup.title.text print(title) </code></pre> <p>在上面的代码中,我们首先使用requests库发送HTTP请求,并使用response.text属性获取响应数据的HTML文本。然后,我们使用BeautifulSoup库将HTML文本解析为Python对象,并使用text属性获取<title>标签的文本内容。最后,我们使用print()函数输出标题。</p> <h2>示例2:使用Selenium库模拟浏览器行为</h2> <p>以下是一个示例代码,演示如何使用Selenium库模拟浏览器行为:</p> <pre><code class="language-python">from selenium import webdriver driver = webdriver.Chrome() url = 'http://example.com' driver.get(url) element = driver.find_element_by_xpath('//input[@name="q"]') element.send_keys('Python') element.submit() print(driver.title) driver.quit() </code></pre> <p>在上面的代码中,我们首先创建了一个Chrome浏览器对象,并使用get()方法打开了一个网页。然后,我们使用find_element_by_xpath()方法查找一个输入框,并使用send_keys()方法输入文本。最后,我们使用submit()方法提交表单,并使用title属性获取网页标题。最后,我们使用quit()方法关闭浏览器。</p> <div class="entry-readmore"><div class="entry-readmore-btn"></div></div> <div class="entry-copyright"><p>本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:<a href="https://pythonjishu.com/jonhakwgvtszmbt/">Python常用的爬虫技巧总结 - Python技术站</a></p></div> </div> <div class="entry-tag"><a href="https://pythonjishu.com/tag/beautifulsoup/" rel="tag">BeautifulSoup</a><a href="https://pythonjishu.com/tag/python/" rel="tag">python</a></div> <div class="entry-action"> <div class="btn-zan" data-id="139842"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-thumb-up-fill"></use></svg></i> 赞 <span class="entry-action-num">(0)</span></div> </div> <div class="entry-bar"> <div class="entry-bar-inner"> <div class="entry-bar-info entry-bar-info2"> <div class="info-item meta"> <a class="meta-item j-heart" href="javascript:;" data-id="139842"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-star"></use></svg></i> <span class="data">0</span></a> <a class="meta-item" href="#comments"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-comment"></use></svg></i> <span class="data">0</span></a> <a class="meta-item dashang" href="javascript:;"> <i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-cny-circle-fill"></use></svg></i> 打赏 <span class="dashang-img dashang-img2"> <span> <img src="//pythonjishu.com/wp-content/uploads/2023/02/2023-02-06_10-34-29.jpg" alt="微信扫一扫"/> 微信扫一扫 </span> <span> <img src="//pythonjishu.com/wp-content/uploads/2023/02/2023-02-06_10-35-01.jpg" alt="支付宝扫一扫"/> 支付宝扫一扫 </span> </span> </a> </div> <div class="info-item share"> <a class="meta-item mobile j-mobile-share" href="javascript:;" data-id="139842" data-qrcode="https://pythonjishu.com/jonhakwgvtszmbt/"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-share"></use></svg></i> 生成海报</a> <a class="meta-item wechat" data-share="wechat" target="_blank" rel="nofollow" href="#"> <i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-wechat"></use></svg></i> </a> <a class="meta-item weibo" data-share="weibo" target="_blank" rel="nofollow" href="#"> <i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-weibo"></use></svg></i> </a> <a class="meta-item qq" data-share="qq" target="_blank" rel="nofollow" href="#"> <i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-qq"></use></svg></i> </a> <a class="meta-item qzone" data-share="qzone" target="_blank" rel="nofollow" href="#"> <i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-qzone"></use></svg></i> </a> </div> <div class="info-item act"> <a href="javascript:;" id="j-reading"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-article"></use></svg></i></a> </div> </div> </div> </div> </div> <div class="entry-page"> <div class="entry-page-prev entry-page-nobg"> <a href="https://pythonjishu.com/qlvjbxqxviqknck/" title="python&MongoDB爬取图书馆借阅记录" rel="prev"> <span>python&MongoDB爬取图书馆借阅记录</span> </a> <div class="entry-page-info"> <span class="pull-left"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-arrow-left-double"></use></svg></i> 上一篇</span> <span class="pull-right">2023年5月15日</span> </div> </div> <div class="entry-page-next entry-page-nobg"> <a href="https://pythonjishu.com/iqmtjynutgzbzqo/" title="Python实现鼠标自动在屏幕上随机移动功能" rel="next"> <span>Python实现鼠标自动在屏幕上随机移动功能</span> </a> <div class="entry-page-info"> <span class="pull-right">下一篇 <i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-arrow-right-double"></use></svg></i></span> <span class="pull-left">2023年5月15日</span> </div> </div> </div> <div class="entry-related-posts"> <h3 class="entry-related-title">相关文章</h3><ul class="entry-related cols-3 post-loop post-loop-default"><li class="item item-no-thumb"> <div class="item-content"> <h3 class="item-title"> <a href="https://pythonjishu.com/dwxzwjffyd/" rel="bookmark"> Django笔记三十七之多数据库操作(补充版) </a> </h3> <div class="item-excerpt"> <p>本文首发于公众号:Hunter后端 原文链接:Django笔记三十七之多数据库操作(补充版) 这一篇笔记介绍一下 Django 里使用多数据库操作。 在第二十二篇笔记中只介绍了多数据库的定义、同步命令和使用方式,这一篇笔记作为补充详细介绍如何对 Django 系统的多个数据库进行针对的建表同步操作。 以下是本篇笔记目录: DATABASES 定义 appli…</p> </div> <div class="item-meta"> <a class="item-meta-li" href="https://pythonjishu.com/python/python-2/" target="_blank">python</a> <span class="item-meta-li date">2023年5月7日</span> <div class="item-meta-right"> <a class="item-meta-li comments" href="https://pythonjishu.com/dwxzwjffyd/#comments" target="_blank" title="评论数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-comment"></use></svg></i>0</a><span class="item-meta-li stars" title="收藏数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-star"></use></svg></i>0</span><span class="item-meta-li likes" title="点赞数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-thumb-up"></use></svg></i>0</span> </div> </div> </div> </li> <li class="item item-no-thumb"> <div class="item-content"> <h3 class="item-title"> <a href="https://pythonjishu.com/tgevqnsanozrkxp/" rel="bookmark"> Python入门学习之字符串与比较运算符 </a> </h3> <div class="item-excerpt"> <p>Python入门学习之字符串和比较运算符 字符串 字符串是一系列字符的序列,通常用来表示文本信息。在Python中,字符串可以用单引号或双引号包含起来,例如: # 使用单引号表示字符串 string1 = ‘Hello, world!’ # 使用双引号表示字符串 string2 = "Python is awesome!" 这两种表示方法…</p> </div> <div class="item-meta"> <a class="item-meta-li" href="https://pythonjishu.com/python/python-2/" target="_blank">python</a> <span class="item-meta-li date">2023年6月5日</span> <div class="item-meta-right"> <a class="item-meta-li comments" href="https://pythonjishu.com/tgevqnsanozrkxp/#comments" target="_blank" title="评论数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-comment"></use></svg></i>0</a><span class="item-meta-li stars" title="收藏数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-star"></use></svg></i>0</span><span class="item-meta-li likes" title="点赞数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-thumb-up"></use></svg></i>0</span> </div> </div> </div> </li> <li class="item item-no-thumb"> <div class="item-content"> <h3 class="item-title"> <a href="https://pythonjishu.com/eznwoaimkzvqhno/" rel="bookmark"> 在主流系统之上安装Pygame的方法 </a> </h3> <div class="item-excerpt"> <p>在主流系统之上安装Pygame的方法可以分为以下几步: 安装Python解释器 在安装Pygame之前,需要先安装Python解释器。可以从官网 https://www.python.org/downloads/ 下载对应操作系统的Python安装包。安装时需要注意勾选“Add Python to PATH”选项,这样才能在命令行中使用python命令。 安…</p> </div> <div class="item-meta"> <a class="item-meta-li" href="https://pythonjishu.com/python/python-2/" target="_blank">python</a> <span class="item-meta-li date">2023年5月14日</span> <div class="item-meta-right"> <a class="item-meta-li comments" href="https://pythonjishu.com/eznwoaimkzvqhno/#comments" target="_blank" title="评论数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-comment"></use></svg></i>0</a><span class="item-meta-li stars" title="收藏数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-star"></use></svg></i>0</span><span class="item-meta-li likes" title="点赞数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-thumb-up"></use></svg></i>0</span> </div> </div> </div> </li> <li class="item item-no-thumb"> <div class="item-content"> <h3 class="item-title"> <a href="https://pythonjishu.com/vhdxdwwloizimmp/" rel="bookmark"> python利用多线程+队列技术爬取中介网互联网网站排行榜 </a> </h3> <div class="item-excerpt"> <p>Python利用多线程+队列技术爬取中介网互联网网站排行榜 本文将详细讲解如何使用Python的多线程和队列技术爬取中介网互联网网站排行榜。我们将使用requests和BeautifulSoup库来获取和解析网页内容,使用多线程和队列技术来提高爬取效率。 爬取网页内容 首先,我们需要使用requests库来获取网页内容。以下是一个获取网页内容的示例: imp…</p> </div> <div class="item-meta"> <a class="item-meta-li" href="https://pythonjishu.com/python/python-2/" target="_blank">python</a> <span class="item-meta-li date">2023年5月15日</span> <div class="item-meta-right"> <a class="item-meta-li comments" href="https://pythonjishu.com/vhdxdwwloizimmp/#comments" target="_blank" title="评论数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-comment"></use></svg></i>0</a><span class="item-meta-li stars" title="收藏数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-star"></use></svg></i>0</span><span class="item-meta-li likes" title="点赞数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-thumb-up"></use></svg></i>0</span> </div> </div> </div> </li> <li class="item item-no-thumb"> <div class="item-content"> <h3 class="item-title"> <a href="https://pythonjishu.com/efndsjtiekvyula/" rel="bookmark"> Python中assert函数的使用(含源代码) </a> </h3> <div class="item-excerpt"> <p>Python中assert函数的使用 在Python中,assert函数是一种常用的调试工具。它用于检查一个条件是否为真,如果条件为假,则会抛出AssertionError异常。本文将为您详细讲解Python中assert函数的使用,包括语法、参数、示例等。 assert函数语法 assert函数的语法如下: assert expression [, arg…</p> </div> <div class="item-meta"> <a class="item-meta-li" href="https://pythonjishu.com/python/python-2/" target="_blank">python</a> <span class="item-meta-li date">2023年5月14日</span> <div class="item-meta-right"> <a class="item-meta-li comments" href="https://pythonjishu.com/efndsjtiekvyula/#comments" target="_blank" title="评论数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-comment"></use></svg></i>0</a><span class="item-meta-li stars" title="收藏数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-star"></use></svg></i>0</span><span class="item-meta-li likes" title="点赞数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-thumb-up"></use></svg></i>0</span> </div> </div> </div> </li> <li class="item item-no-thumb"> <div class="item-content"> <h3 class="item-title"> <a href="https://pythonjishu.com/ihltcmrubugviyd/" rel="bookmark"> Pytorch在dataloader类中设置shuffle的随机数种子方式 </a> </h3> <div class="item-excerpt"> <p>PyTorch的数据集DataLoader是十分常用的数据加载和预处理工具,通过将数据传输到GPU并在深度学习过程中进行抽样,而它的shuffle参数可以打乱数据集的顺序,使损失函数更加随机。但同时,我们也可能需要控制随机的行为,以获得可再现的实验结果。下面是两种设置shuffle随机数种子的方法: 方法一:使用torch.utils.data.DataLo…</p> </div> <div class="item-meta"> <a class="item-meta-li" href="https://pythonjishu.com/python/python-2/" target="_blank">python</a> <span class="item-meta-li date">2023年6月3日</span> <div class="item-meta-right"> <a class="item-meta-li comments" href="https://pythonjishu.com/ihltcmrubugviyd/#comments" target="_blank" title="评论数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-comment"></use></svg></i>0</a><span class="item-meta-li stars" title="收藏数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-star"></use></svg></i>0</span><span class="item-meta-li likes" title="点赞数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-thumb-up"></use></svg></i>0</span> </div> </div> </div> </li> <li class="item item-myimg"><div class="wpcom_myimg_wrap __flow"><script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-2252152819722406" crossorigin="anonymous"></script> <!-- 通用 --> <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-2252152819722406" data-ad-slot="5528197265" data-ad-format="auto" data-full-width-responsive="true"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script></div></li><li class="item item-no-thumb"> <div class="item-content"> <h3 class="item-title"> <a href="https://pythonjishu.com/rxphrbjmpovdlvc/" rel="bookmark"> Python协程原理全面分析 </a> </h3> <div class="item-excerpt"> <p>Python 协程原理全面分析 在介绍Python协程原理之前,需要先了解一些概念: 并发:同时处理多个任务。 并行:同时处理多个任务并使它们同时运行。关注于任务的执行,强调在物理上同时运行多个任务。 同步:任务按照一定的顺序进行,只有先完成前面任务才能完成后面任务。 异步:不按照任务排定的先后顺序进行,而是根据情况随时安排执行任务。异步任务可以在等待IO的…</p> </div> <div class="item-meta"> <a class="item-meta-li" href="https://pythonjishu.com/python/python-2/" target="_blank">python</a> <span class="item-meta-li date">2023年5月19日</span> <div class="item-meta-right"> <a class="item-meta-li comments" href="https://pythonjishu.com/rxphrbjmpovdlvc/#comments" target="_blank" title="评论数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-comment"></use></svg></i>0</a><span class="item-meta-li stars" title="收藏数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-star"></use></svg></i>0</span><span class="item-meta-li likes" title="点赞数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-thumb-up"></use></svg></i>0</span> </div> </div> </div> </li> <li class="item item-no-thumb"> <div class="item-content"> <h3 class="item-title"> <a href="https://pythonjishu.com/tyrsxqkbfiyvenj/" rel="bookmark"> 写给iOS程序员的命令行使用秘籍 </a> </h3> <div class="item-excerpt"> <p>为iOS程序员提供的命令行使用秘籍旨在帮助程序员更好地理解和使用命令行,从而更有效地进行开发。本文将为大家介绍这些秘籍的主要内容。 1. 安装Homebrew Homebrew是Mac OS X下的包管理器,方便程序员安装和管理各种开发工具。 安装命令: $ /usr/bin/ruby -e "$(curl -fsSL https://raw.gi…</p> </div> <div class="item-meta"> <a class="item-meta-li" href="https://pythonjishu.com/python/python-2/" target="_blank">python</a> <span class="item-meta-li date">2023年6月3日</span> <div class="item-meta-right"> <a class="item-meta-li comments" href="https://pythonjishu.com/tyrsxqkbfiyvenj/#comments" target="_blank" title="评论数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-comment"></use></svg></i>0</a><span class="item-meta-li stars" title="收藏数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-star"></use></svg></i>0</span><span class="item-meta-li likes" title="点赞数"><i class="wpcom-icon wi"><svg aria-hidden="true"><use xlink:href="#wi-thumb-up"></use></svg></i>0</span> </div> </div> </div> </li> </ul> </div> </article> </main> <aside class="sidebar"> <div class="widget widget_html_myimg"><script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-2252152819722406" crossorigin="anonymous"></script> <!-- 通用 --> <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-2252152819722406" data-ad-slot="5528197265" data-ad-format="auto" data-full-width-responsive="true"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script></div><div class="widget widget_tags"><h3 class="widget-title"><span>热门标签</span></h3> <div class="tagcloud"> <a href="https://pythonjishu.com/tag/python/" title="python">python</a> <a href="https://pythonjishu.com/tag/ai/" title="人工智能">人工智能</a> <a href="https://pythonjishu.com/tag/node-js/" title="node js">node js</a> <a href="https://pythonjishu.com/tag/pandas/" title="Pandas">Pandas</a> <a href="https://pythonjishu.com/tag/django/" title="django">django</a> <a href="https://pythonjishu.com/tag/nginx/" title="Nginx">Nginx</a> <a href="https://pythonjishu.com/tag/%e7%88%ac%e8%99%ab/" title="爬虫">爬虫</a> <a href="https://pythonjishu.com/tag/docker/" title="Docker">Docker</a> <a href="https://pythonjishu.com/tag/numpy/" title="NumPy">NumPy</a> <a href="https://pythonjishu.com/tag/%e5%8d%b7%e7%a7%af%e7%a5%9e%e7%bb%8f%e7%bd%91%e7%bb%9c/" title="卷积神经网络">卷积神经网络</a> <a href="https://pythonjishu.com/tag/%e7%9b%ae%e6%a0%87%e6%a3%80%e6%b5%8b/" title="目标检测">目标检测</a> <a href="https://pythonjishu.com/tag/machine-learning/" title="机器学习">机器学习</a> <a href="https://pythonjishu.com/tag/rabbitmq/" title="rabbitmq">rabbitmq</a> <a href="https://pythonjishu.com/tag/%e5%be%aa%e7%8e%af%e7%a5%9e%e7%bb%8f%e7%bd%91%e7%bb%9c/" title="循环神经网络">循环神经网络</a> <a href="https://pythonjishu.com/tag/pip/" title="pip">pip</a> <a href="https://pythonjishu.com/tag/unity/" title="Unity">Unity</a> <a href="https://pythonjishu.com/tag/wcf/" title="wcf">wcf</a> <a href="https://pythonjishu.com/tag/apache/" title="apache">apache</a> </div> </div><div class="widget widget_lastest_news"><h3 class="widget-title"><span>热门文章</span></h3> <ul class="orderby-meta_value_num"> <li><a href="https://pythonjishu.com/python-list-search/" title="Python查询列表元素的5种常用方法">Python查询列表元素的5种常用方法</a></li> <li><a href="https://pythonjishu.com/python-custom-module/" title="Python 如何自定义模块(详解版)">Python 如何自定义模块(详解版)</a></li> <li><a href="https://pythonjishu.com/python-close-file/" title="Python 关闭文件(close)函数使用方法">Python 关闭文件(close)函数使用方法</a></li> <li><a href="https://pythonjishu.com/python-write-file/" title="Python 写入文件数据(write)函数使用方法">Python 写入文件数据(write)函数使用方法</a></li> <li><a href="https://pythonjishu.com/python-float/" title="Python小数类型(float)详解">Python小数类型(float)详解</a></li> <li><a href="https://pythonjishu.com/python-complex/" title="详解Python中复数类型的创建、比较与运算!">详解Python中复数类型的创建、比较与运算!</a></li> <li><a href="https://pythonjishu.com/python-reversed/" title="Python 反转序列(reversed函数)使用方法">Python 反转序列(reversed函数)使用方法</a></li> <li><a href="https://pythonjishu.com/python-eval-exec/" title="Python 将字符串转换为代码的函数(eval和exec)详解">Python 将字符串转换为代码的函数(eval和exec)详解</a></li> <li><a href="https://pythonjishu.com/python-none/" title="Python 空值None用法详解">Python 空值None用法详解</a></li> <li><a href="https://pythonjishu.com/python-read-file/" title="Python 读取文件(read)函数使用方法">Python 读取文件(read)函数使用方法</a></li> </ul> </div> </aside> </div> </div> <footer class="footer"> <div class="container"> <div class="footer-col-wrap footer-with-icon"> <div class="footer-col footer-col-copy"> <ul class="footer-nav hidden-xs"><li id="menu-item-374373" class="menu-item menu-item-374373"><a href="https://pythonjishu.com/about/">关于我们</a></li> <li id="menu-item-374372" class="menu-item menu-item-privacy-policy menu-item-374372"><a rel="privacy-policy" href="https://pythonjishu.com/privacy-policy/">隐私政策</a></li> </ul> <div class="copyright"> <div class="copyright"> <p style="text-align: left;">© 2022-2024 <strong><a href="https://pythonjishu.com/" target="_blank" rel="noopener">Python技术站</a> </strong> 保留所有权利</p> <p style="text-align: left;"><img class="" src="https://pythonjishu.com/wp-content/uploads/2023/11/baico.png" alt="baico" width="16" height="18" /> <a href="https://beian.mps.gov.cn/#/query/webSearch?code=21010502000733" target="_blank" rel="nofollow noopener noreferrer">辽公网安备21010502000733号</a> <a href="https://beian.miit.gov.cn" target="_blank" rel="nofollow noopener noreferrer">辽ICP备18014290号</a></p> <p><img class="alignleft" src="https://pythonjishu.com/wp-content/uploads/2023/11/aliprotected.png" alt="aliprotected" width="244" height="26" /></p> </div> </div> </div> <div class="footer-col footer-col-sns"> <div class="footer-sns"> <a class="sns-wx" href="javascript:;" aria-label="icon"> <i class="wpcom-icon fa fa-wechat sns-icon"></i> <span style="background-image:url('//pythonjishu.com/wp-content/uploads/2023/01/wechat-metahuber.jpg');"></span> </a> <a class="sns-wx" href="javascript:;" aria-label="icon"> <i class="wpcom-icon ri-music-fill sns-icon"></i> <span style="background-image:url('//pythonjishu.com/wp-content/uploads/2023/05/2023-05-07_20-49-41.jpg');"></span> </a> </div> </div> </div> </div> </footer> <div class="action action-style-1 action-color-1 action-pos-1" style="bottom:320px;"> <div class="action-item"> <i class="wpcom-icon fa fa-wechat action-item-icon"></i> <span>合作推广</span> <div class="action-item-inner action-item-type-1"> <img class="action-item-img" src="//pythonjishu.com/wp-content/uploads/2023/01/wechat-metahuber.jpg" alt="合作推广"> </div> </div> <div class="action-item j-share"> <i class="wpcom-icon wi action-item-icon"><svg aria-hidden="true"><use xlink:href="#wi-share"></use></svg></i> <span>分享本页</span> </div> <div class="action-item gotop j-top"> <i class="wpcom-icon wi action-item-icon"><svg aria-hidden="true"><use xlink:href="#wi-arrow-up-2"></use></svg></i> <span>返回顶部</span> </div> </div> <script type="text/javascript" id="main-js-extra"> /* <![CDATA[ */ var _wpcom_js = {"webp":"","ajaxurl":"https:\/\/pythonjishu.com\/wp-admin\/admin-ajax.php","theme_url":"https:\/\/pythonjishu.com\/wp-content\/themes\/justnews","slide_speed":"5000","is_admin":"0","lang":"zh_CN","js_lang":{"share_to":"\u5206\u4eab\u5230:","copy_done":"\u590d\u5236\u6210\u529f\uff01","copy_fail":"\u6d4f\u89c8\u5668\u6682\u4e0d\u652f\u6301\u62f7\u8d1d\u529f\u80fd","confirm":"\u786e\u5b9a","qrcode":"\u4e8c\u7ef4\u7801","page_loaded":"\u5df2\u7ecf\u5230\u5e95\u4e86","no_content":"\u6682\u65e0\u5185\u5bb9","load_failed":"\u52a0\u8f7d\u5931\u8d25\uff0c\u8bf7\u7a0d\u540e\u518d\u8bd5\uff01","expand_more":"\u9605\u8bfb\u5269\u4f59 %s"},"share":"1","lightbox":"1","post_id":"139842","user_card_height":"356","poster":{"notice":"\u8bf7\u300c\u70b9\u51fb\u4e0b\u8f7d\u300d\u6216\u300c\u957f\u6309\u4fdd\u5b58\u56fe\u7247\u300d\u540e\u5206\u4eab\u7ed9\u66f4\u591a\u597d\u53cb","generating":"\u6b63\u5728\u751f\u6210\u6d77\u62a5\u56fe\u7247...","failed":"\u6d77\u62a5\u56fe\u7247\u751f\u6210\u5931\u8d25"},"video_height":"484","fixed_sidebar":"1","dark_style":"0","font_url":"\/\/fonts.googleapis.com\/css2?family=Noto+Sans+SC:wght@400;500&display=swap","follow_btn":"<i class=\"wpcom-icon wi\"><svg aria-hidden=\"true\"><use xlink:href=\"#wi-add\"><\/use><\/svg><\/i>\u5173\u6ce8","followed_btn":"\u5df2\u5173\u6ce8","user_card":"1"}; /* ]]> */ </script> <script type="text/javascript" src="https://pythonjishu.com/wp-content/themes/justnews/js/main.js?ver=6.19.0" id="main-js"></script> <script type="text/javascript" src="https://pythonjishu.com/wp-content/themes/justnews/themer/assets/js/icons-2.7.19.js?ver=6.19.0" id="wpcom-icons-js"></script> <script type="text/javascript" id="wpcom-member-js-extra"> /* <![CDATA[ */ var _wpmx_js = {"ajaxurl":"https:\/\/pythonjishu.com\/wp-admin\/admin-ajax.php","plugin_url":"https:\/\/pythonjishu.com\/wp-content\/plugins\/wpcom-member\/","post_id":"139842","js_lang":{"login_desc":"\u60a8\u8fd8\u672a\u767b\u5f55\uff0c\u8bf7\u767b\u5f55\u540e\u518d\u8fdb\u884c\u76f8\u5173\u64cd\u4f5c\uff01","login_title":"\u8bf7\u767b\u5f55","login_btn":"\u767b\u5f55","reg_btn":"\u6ce8\u518c"},"login_url":"https:\/\/pythonjishu.com\/%e7%94%a8%e6%88%b7%e7%99%bb%e5%bd%95\/?modal-type=login","register_url":"https:\/\/pythonjishu.com\/%e7%94%a8%e6%88%b7%e6%b3%a8%e5%86%8c\/?modal-type=register","errors":{"require":"\u4e0d\u80fd\u4e3a\u7a7a","email":"\u8bf7\u8f93\u5165\u6b63\u786e\u7684\u7535\u5b50\u90ae\u7bb1","pls_enter":"\u8bf7\u8f93\u5165","password":"\u5bc6\u7801\u5fc5\u987b\u4e3a6~32\u4e2a\u5b57\u7b26","passcheck":"\u4e24\u6b21\u5bc6\u7801\u8f93\u5165\u4e0d\u4e00\u81f4","phone":"\u8bf7\u8f93\u5165\u6b63\u786e\u7684\u624b\u673a\u53f7\u7801","terms":"\u8bf7\u9605\u8bfb\u5e76\u540c\u610f\u6761\u6b3e","sms_code":"\u9a8c\u8bc1\u7801\u9519\u8bef","captcha_verify":"\u8bf7\u70b9\u51fb\u6309\u94ae\u8fdb\u884c\u9a8c\u8bc1","captcha_fail":"\u4eba\u673a\u9a8c\u8bc1\u5931\u8d25\uff0c\u8bf7\u91cd\u8bd5","nonce":"\u968f\u673a\u6570\u6821\u9a8c\u5931\u8d25","req_error":"\u8bf7\u6c42\u5931\u8d25"}}; /* ]]> */ </script> <script type="text/javascript" src="https://pythonjishu.com/wp-content/plugins/wpcom-member/js/index.js?ver=1.5.2.1" id="wpcom-member-js"></script> <script type="text/javascript" src="https://pythonjishu.com/wp-content/themes/justnews/js/wp-embed.js?ver=6.19.0" id="wp-embed-js"></script> <script id="module-flowchart"> (function($) { $(function() { if (typeof $.fn.flowChart !== "undefined") { if ($(".language-flow").length > 0) { $(".language-flow").parent("pre").attr("style", "text-align: center; background: none;"); $(".language-flow").addClass("flowchart").removeClass("language-flow"); $(".flowchart").flowChart(); } } }); })(jQuery); </script> <script id="module-sequence-diagram"> (function($) { $(function() { if (typeof $.fn.sequenceDiagram !== "undefined") { $(".language-sequence").parent("pre").attr("style", "text-align: center; background: none;"); $(".language-seq").parent("pre").attr("style", "text-align: center; background: none;"); $(".language-sequence").addClass("sequence-diagram").removeClass("language-sequence"); $(".language-seq").addClass("sequence-diagram").removeClass("language-seq"); $(".sequence-diagram").sequenceDiagram({ theme: "simple" }); } }); })(jQuery); </script> <script id="module-toc"> (function($) { $(function() { }); })(jQuery); </script> <script>document.getElementById('j-user-wrap').style.display="none";</script> <script type="application/ld+json"> { "@context": "https://schema.org", "@type": "Article", "@id": "https://pythonjishu.com/jonhakwgvtszmbt/", "url": "https://pythonjishu.com/jonhakwgvtszmbt/", "headline": "Python常用的爬虫技巧总结", "description": "Python常用的爬虫技巧总结 在本攻略中,我们将介绍Python常用的爬虫技巧,包括如何使用requests库发送HTTP请求、如何使用BeautifulSoup库解析HTML文档、如何使用正则表达式提取数据、如何使用Selenium库模拟浏览器行为、如何使用代理IP和用户代理等技巧。我们将提供两个示例,演示如何使用…", "datePublished": "2023-05-15T04:05:16+08:00", "dateModified": "2023-05-15T04:05:16+08:00", "author": {"@type":"Person","name":"Python技术站官方","url":"/1","image":"//pythonjishu.com/wp-content/uploads/2018/07/f9352ad8b4a1ce8c616fe60de409e340.jpg"} } </script> </body> </html> <!-- Cached by WP-Optimize (gzip) - https://getwpo.com - Last modified: 2024年11月5日 am5:55 (Asia/Shanghai UTC:8) -->