下面为你详细讲解Python抓取百度首页的方法的完整攻略。
1. 准备工作
在Python中,我们可以使用requests模块来发送HTTP请求来获取网页内容。因此,在使用前需要先安装requests模块。
pip install requests
2. 发送HTTP请求
接下来,我们要通过requests模块发送HTTP GET请求来获取百度首页的HTML源代码。
import requests
response = requests.get('http://www.baidu.com')
通过使用requests模块的get方法,我们就能抓取到百度首页的HTML源代码。在得到响应后,我们可以通过response.text来获取网页HTML源代码。
3. 解析HTML源代码
接下来,我们需要解析HTML源代码,以获取我们所需要的信息。我们可以使用BeautifulSoup库来完成此任务。
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
在使用BeautifulSoup库的时候,需要首先将网页HTML源代码传入BeautifulSoup的构造函数中,构造出一个BeautifulSoup对象,它可以帮助我们快速地解析出我们所需要的信息。
4. 获取所需信息
获取到BeautifulSoup对象后,我们就可以通过它来获取网页中我们所需要的信息了。
例如,获取百度首页的标题:
print(soup.title.string)
输出:
百度一下,你就知道
再比如,获取百度首页的所有链接:
for link in soup.find_all('a'):
print(link.get('href'))
输出:
javascript:void(0);
http://news.baidu.com
http://www.hao123.com
http://map.baidu.com
http://v.baidu.com
https://tieba.baidu.com
http://xueshu.baidu.com
https://zhidao.baidu.com
http://music.baidu.com
http://image.baidu.com
http://www.baidu.com/duty/
http://jianyi.baidu.com/
http://www.baidu.com/duty/
http://ir.baidu.com
http://www.baidu.com/about
http://www.baidu.com/home/feedback.html
javascript:;
http://top.baidu.com
示例1:获取百度首页的所有图片链接
from bs4 import BeautifulSoup
import requests
url = 'http://www.baidu.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for img in soup.find_all('img'):
print(img.get('src'))
输出:
http://www.baidu.com/img/bd_logo1.png
http://s1.bdstatic.com/r/www/cache/res/static/baidu/result/bg.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/N/A.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d0.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d1.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d2.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d3.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d4.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d5.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d6.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d7.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d8.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d9.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e0.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e1.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e2.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e3.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e4.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e5.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e6.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e7.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e8.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e9.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/f0.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/f1.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/9.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/10.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/11.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/12.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/13.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/14.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/15.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/16.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/17.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/18.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/19.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/20.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/21.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/22.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/23.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/24.png
示例2:获取百度首页的所有文字链接
from bs4 import BeautifulSoup
import requests
url = 'http://www.baidu.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for link in soup.find_all('a'):
print(link.string)
输出:
None
新闻
hao123
地图
视频
贴吧
学术
知道
音乐
图片
更多产品
关于百度
使用百度前必读
意见反馈
京ICP证030173号
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:python抓取百度首页的方法 - Python技术站