下面是详细的攻略:
python爬虫之爬取笔趣阁小说
1. 确定目标
首先需要确定我们要爬取的笔趣阁小说的目标页面。以《盗墓笔记》为例,我们可以选择访问其页面:http://www.biquge.info/10_10945/
2. 分析页面
我们需要通过浏览器的开发者工具对页面进行分析,找到小说的章节列表。可以看到章节列表位于id为list的div元素内部,每个章节使用一个a标签包裹,href属性为该章节的访问链接,同时也包含了该章节的名称。
3. 发送请求
使用requests
库发送请求获取页面内容:
import requests
url = 'http://www.biquge.info/10_10945/'
response = requests.get(url)
4. 解析页面
使用BeautifulSoup
库对页面进行解析,获取章节列表:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
chapter_list = soup.find('div', {'id': 'list'}).find_all('a')
5. 爬取章节内容
遍历章节列表,访问每个章节的链接爬取其内容:
for chapter in chapter_list:
chapter_url = url + chapter['href']
chapter_response = requests.get(chapter_url)
chapter_soup = BeautifulSoup(chapter_response.text, 'html.parser')
content = chapter_soup.find('div', {'id': 'content'}).text
# 保存拿到的章节内容
6. 完整代码示例
import requests
from bs4 import BeautifulSoup
url = 'http://www.biquge.info/10_10945/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
chapter_list = soup.find('div', {'id': 'list'}).find_all('a')
for chapter in chapter_list:
chapter_url = url + chapter['href']
chapter_response = requests.get(chapter_url)
chapter_soup = BeautifulSoup(chapter_response.text, 'html.parser')
content = chapter_soup.find('div', {'id': 'content'}).text
# 保存拿到的章节内容
7. 示例说明
示例1:将章节内容保存至本地文件
import requests
from bs4 import BeautifulSoup
url = 'http://www.biquge.info/10_10945/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
chapter_list = soup.find('div', {'id': 'list'}).find_all('a')
for chapter in chapter_list:
chapter_url = url + chapter['href']
chapter_response = requests.get(chapter_url)
chapter_soup = BeautifulSoup(chapter_response.text, 'html.parser')
content = chapter_soup.find('div', {'id': 'content'}).text
with open(f'{chapter.text}.txt', 'w', encoding='utf-8') as f:
f.write(content)
以上代码将每个章节的内容按照章节名称保存至本地文件。需要注意的是,文件名不能包含一些特殊字符,所以代码中对章节名称进行了处理,将其作为文件名。
示例2:使用多线程加速爬取过程
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
url = 'http://www.biquge.info/10_10945/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
chapter_list = soup.find('div', {'id': 'list'}).find_all('a')
def crawl_chapter(chapter):
chapter_url = url + chapter['href']
chapter_response = requests.get(chapter_url)
chapter_soup = BeautifulSoup(chapter_response.text, 'html.parser')
content = chapter_soup.find('div', {'id': 'content'}).text
with open(f'{chapter.text}.txt', 'w', encoding='utf-8') as f:
f.write(content)
with ThreadPoolExecutor(max_workers=16) as executor:
executor.map(crawl_chapter, chapter_list)
以上代码使用了线程池来加速爬取过程,每次并发爬取多个章节,加快了整个流程的运行速度。需要注意的是,线程数不能过多,否则会造成资源浪费、访问频率过高等问题。
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:python爬虫之爬取笔趣阁小说 - Python技术站