本来晚上是准备写贴吧爬虫的,但是在分析页面时就遇到了大麻烦!选取了某个帖子,在爬取的时候,发现正则匹配不全..很尴尬!!先来看看吧,

 1 #!/usr/bin/env python
 2 # -*- coding:utf-8 -*-
 3 __author__ = 'ziv·chan'
 4 
 5 
 6 import requests
 7 import re
 8 
 9 url = 'http://tieba.baidu.com/p/3138733512?see_lz=1&pn=3'
10 html = requests.get(url)
11 html.encoding = 'utf-8'
12 pageCode = html.text
13 
14 pattern = re.compile('d_post_content j_d_post_content ">(.*?)</div><br>',re.S)
15 items = re.findall(pattern,pageCode)
16 i = 1
17 for item in items:
18     hasImg = re.search('<img',item)
19     hasHref = re.search('href',item)
20     # 过滤img
21     if hasImg:
22         pattern_1 = re.compile('<img class="BDE_Image".*?<br><br>')
23         item = re.sub(pattern_1,'',item)
24     # 过滤href
25     if hasHref:
26         pattern_2 = re.compile('onclick="Stats.sendRequest.*?class="at">(.*?)</a>',re.S)
27         item = re.findall(pattern_2,item)
28 
29     print str(i) + ':'
30     # 提取href标签下的用户
31     if type(item) is list:
32         for each in item:
33             print each
34     else:
35         # 过滤多余标签 ' <br> '
36         pattern_Br = re.compile('<br>')
37         item = re.sub(pattern_Br, '\n', item)
38         # 默认删除空白符
39         print item.strip()
40     print '\n'
41     i += 1
42     # if not hasImg and not hasHref:
43     #     print i
44     #     print item.strip()
45     #     i += 1

本来都以为大功告成了,结果..结果在提取含有@的content的时候,不是少这个就是缺那个...心塞,正则的功夫还是没下够,但是今天白天学得那些方法还是现学现用了,Get!

明天看看静觅怎么做的,又是一顿大餐,好好消化,加油!!