python抓取百度首页的方法

下面为你详细讲解Python抓取百度首页的方法的完整攻略。

1. 准备工作

在Python中，我们可以使用requests模块来发送HTTP请求来获取网页内容。因此，在使用前需要先安装requests模块。

pip install requests

2. 发送HTTP请求

接下来，我们要通过requests模块发送HTTP GET请求来获取百度首页的HTML源代码。

import requests

response = requests.get('http://www.baidu.com')

通过使用requests模块的get方法，我们就能抓取到百度首页的HTML源代码。在得到响应后，我们可以通过response.text来获取网页HTML源代码。

3. 解析HTML源代码

接下来，我们需要解析HTML源代码，以获取我们所需要的信息。我们可以使用BeautifulSoup库来完成此任务。

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')

在使用BeautifulSoup库的时候，需要首先将网页HTML源代码传入BeautifulSoup的构造函数中，构造出一个BeautifulSoup对象，它可以帮助我们快速地解析出我们所需要的信息。

4. 获取所需信息

获取到BeautifulSoup对象后，我们就可以通过它来获取网页中我们所需要的信息了。

例如，获取百度首页的标题：

print(soup.title.string)

输出：

百度一下，你就知道

再比如，获取百度首页的所有链接：

for link in soup.find_all('a'):
    print(link.get('href'))

输出：

javascript:void(0);
http://news.baidu.com
http://www.hao123.com
http://map.baidu.com
http://v.baidu.com
https://tieba.baidu.com
http://xueshu.baidu.com
https://zhidao.baidu.com
http://music.baidu.com
http://image.baidu.com
http://www.baidu.com/duty/
http://jianyi.baidu.com/
http://www.baidu.com/duty/
http://ir.baidu.com
http://www.baidu.com/about
http://www.baidu.com/home/feedback.html
javascript:;
http://top.baidu.com

示例1：获取百度首页的所有图片链接

from bs4 import BeautifulSoup
import requests

url = 'http://www.baidu.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

for img in soup.find_all('img'):
    print(img.get('src'))

输出：

http://www.baidu.com/img/bd_logo1.png
http://s1.bdstatic.com/r/www/cache/res/static/baidu/result/bg.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/N/A.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d0.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d1.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d2.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d3.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d4.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d5.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d6.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d7.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d8.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/d9.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e0.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e1.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e2.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e3.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e4.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e5.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e6.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e7.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e8.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/e9.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/f0.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/20x20/f1.gif
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/9.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/10.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/11.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/12.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/13.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/14.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/15.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/16.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/17.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/18.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/19.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/20.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/21.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/22.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/23.png
http://s1.bdstatic.com/r/www/cache/static/global/img/weather/25x29/24.png

示例2：获取百度首页的所有文字链接

from bs4 import BeautifulSoup
import requests

url = 'http://www.baidu.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

for link in soup.find_all('a'):
    print(link.string)

输出：

None
新闻
hao123
地图
视频
贴吧
学术
知道
音乐
图片
更多产品
关于百度
使用百度前必读
意见反馈
京ICP证030173号

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python抓取百度首页的方法 - Python技术站