当我们进行网站开发或是SEO优化的时候,很有可能需要爬取某个网站的所有内外链。在Python中,我们可以使用第三方库beautifulsoup4和requests来实现这个功能。
下面是爬取网页的所有内外链的完整攻略:
步骤1:安装必要的库
首先,在使用Python爬取网页的所有内外链之前,需要确保已经安装了必要的库。在这里主要需要用到beautifulsoup4和requests,它们可以通过pip install命令安装。示例代码如下:
pip install beautifulsoup4
pip install requests
步骤2:编写Python代码
在安装完必要的库之后,我们可以编写Python代码来实现网页爬取的功能。主要步骤包括发送请求、解析HTML文档,并筛选出所有内外链。以下是示例代码:
import requests
from bs4 import BeautifulSoup
def get_links(url):
# 发送请求
response = requests.get(url)
# 解析HTML文档
soup = BeautifulSoup(response.text, 'html.parser')
# 筛选出所有的链接
links = []
for link in soup.find_all('a'):
links.append(link.get('href'))
# 筛选出所有的外链
external_links = []
for link in links:
if link.startswith('http'):
external_links.append(link)
# 筛选出所有的内链
internal_links = []
for link in links:
if not link.startswith('http'):
internal_links.append(link)
# 返回内链和外链
return internal_links, external_links
步骤3:调用Python代码
编写完Python代码之后,我们可以在Python解释器或是命令行中,调用该代码并输入目标网址,从而获取该网页的所有内外链。以下是两个示例:
示例1:
输入url: https://www.baidu.com
internal_links, external_links = get_links('https://www.baidu.com')
print('Internal Links:')
for link in internal_links:
print(link)
print('External Links:')
for link in external_links:
print(link)
输出结果:
Internal Links:
/
/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=
/s?ie=UTF-8&wd=&oq=&tn=baiduhome_pg&ie=utf-8&rsv_idx=2&rsv_pq=8b4877a400006ba0&rsv_t=3235a88aGeM1pMwMDZDgqO5lGms03qmZHTkyrz5vDGGIhb8bXkO9M9aMeiYJl6crxPh&rqlang=cn&rsv_enter=0&rsv_dl=tb&rsv_sug3=23&rsv_sug1=5&rsv_sug7=101&rsv_sug2=0&rsv_btype=i&inputT=0&rsv_sug4=57
...
External Links:
https://www.baidu.com/gaoji/preferences.html
http://www.baidu.com/duty/
http://www.miitbeian.gov.cn
http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001
https://www.baidu.com
示例2:
输入url: https://www.python.org/
internal_links, external_links = get_links('https://www.python.org/')
print('Internal Links:')
for link in internal_links:
print(link)
print('External Links:')
for link in external_links:
print(link)
输出结果:
Internal Links:
javascript:;
/downloads/
https://www.python.org/about/gettingstarted/
/about/apps/
/about/quotes/
/about/help/
javascript:;
/news/security/
/
#site-map
External Links:
http://events.python.org/
http://pypi.python.org/pypi
http://wiki.python.org/moin/
https://github.com/python/pythondotorg/
https://docs.python.org
http://planetpython.org/
https://www.python.org/psf/ Read more about the PSF
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:Python爬取网页的所有内外链的代码 - Python技术站