真是太白了,python之路还有很长,今天我从这里开始,留作自己备忘。2018-04-05
花了一个下午学习个爬小说的,总的来说是因为自己没什么基础,哪里不会补哪里,磕磕绊绊的,总算是能运行,先把代码放这里,以后请教高手帮助解决一下。
# -*- coding: utf-8 -*- # @Time : 2018/4/5 13:46 # @Author : ELEVEN # @File : crawerl--小说网.py # @Software: PyCharm import requests import re import time import os header = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:59.0) Gecko/20100101 Firefox/59.0' } def get_type_list(i): url = 'http://www.quanshuwang.com/list/{}_1.html'.format(i) html = requests.get(url, headers = header) html.encoding = 'gbk' html = html.text # print(html) # lis = re.findall(r'<ul class="seeWell cf">.*?<ul>', html, re.S) # lis = re.findall(r'<li><a target="_blank" href="(.*?)" class="l mr10">', html, re.S) novel_list = re.findall(r'<a target="_blank" title="(.*?)" href="(.*?)" class="clearfix stitle">', html, re.S) return novel_list def get_chapter_list(type_url): html = requests.get(type_url, headers = header) html.encoding = 'gbk' html = html.text novel_chapter_html = re.findall(r'<img src="/kukuku/images/only2.png" class="leftso png_bg"><a href="(.*?)" class="l mr11">', html, re.S)[0] html = requests.get(novel_chapter_html) html.encoding = 'gbk' html = html.text novel_chapter = \ re.findall(r'<li><a href="(http://www.quanshuwang.com/book/.*?)" title=".*?">(.*?)</a></li>', html, re.S) # print(novel_chapter) # exit() return novel_chapter def get_chapter_info(chapter_url): html = requests.get(chapter_url, headers = header) html.encoding = 'gbk' html = html.text # print(html) # exit() chapter_info = re.findall( r'<div class="mainContenr".*?</script>(.*?)<script.*?</script></div>', html, re.S)[0] # print(chapter_info) # exit() return chapter_info if __name__ == '__main__': sort_dict = { 1:'玄幻魔法', 2:'武侠修真', 3:'纯爱耽美', 4:'都市言情', 5:'职场校园', 6:'穿越重生', 7:'历史军事', 8:'网游动漫', 9:'恐怖灵异', 10:'科幻小说', 11:'美文名著' } try: if not os.path.exists('全书网'): os.mkdir('全书网') for sort_id, sort_name in sort_dict.items(): if not os.path.exists('%s/%s'%('全书网', sort_name)): os.mkdir('%s/%s'%('全书网', sort_name)) # print('分类名称:', sort_name) for type_name,type_url in get_type_list(sort_id): # print(type_name, type_url) # if not os.path.exists('%s/%s/%s.txt'%('全书网', sort_name, type_name)): # os.mkdir('%s/%s/%s.txt'%('全书网', sort_name, type_name)) for chapter_url, chapter_name in get_chapter_list(type_url): # [::-1]代表列表反向输出 # print(chapter_url, chapter_name, chapter_time) # print(get_chapter_info(chapter_url)) with open('%s/%s/%s.txt'%('全书网', sort_name, type_name), 'a') as f: print('正在保存...',chapter_name) f.write('\n' + chapter_name + '\n') f.write(get_chapter_info(chapter_url)) except OSError as reason: print('wrong') print('问题原因是%s'%str(reason))
没解决的问题:
1、问题原因:('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
自己分析:可能是因为反复访问服务器,服务器认为我是机器人,被反爬了,文件头也有换,爬个几本小说就会出错。
解决结果:没有解决。
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:跟潭州学院的强子老师学习网络爬虫—爬取全书网 - Python技术站