url = http://www.hnu.edu.cn/xyxk/xkzy/zylb.htm
部分表格如图:
部分html代码:
<table class="MsoNormalTable" style="width:353.0pt;margin-left:4.65pt;border-collapse:collapse;border:none; mso-border-alt:solid windowtext .5pt;mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-border-insideh:.5pt solid windowtext;mso-border-insidev:.5pt solid windowtext" width="471" cellspacing="0" cellpadding="0" border="1"> <tbody> <tr class="firstRow" style="mso-yfti-irow:0;mso-yfti-firstrow:yes;height:36.75pt"> <td style="width:170.0pt;border:solid windowtext 1.0pt;mso-border-alt: solid windowtext .5pt;padding:0cm 5.4pt 0cm 5.4pt;height:36.75pt" width="227"><p class="MsoNormal" style="text-align:center;margin-top:6.0pt;margin-right:0cm; margin-bottom:6.0pt;margin-left:0cm;mso-para-margin-top:.5gd;mso-para-margin-right: 0cm;mso-para-margin-bottom:.5gd;mso-para-margin-left:0cm; mso-pagination:widow-orphan"><strong><span style="font-size:9.0pt;font-family: 宋体;mso-bidi-font-family:宋体;mso-font-kerning:0pt">学院<span lang="EN-US"> <o:p></o:p></span></span></strong></p></td> <td style="width:183.0pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0cm 5.4pt 0cm 5.4pt;height:36.75pt" width="244" nowrap=""><p class="MsoNormal" style="text-align:center;margin-top:6.0pt;margin-right:0cm; margin-bottom:6.0pt;margin-left:0cm;mso-para-margin-top:.5gd;mso-para-margin-right: 0cm;mso-para-margin-bottom:.5gd;mso-para-margin-left:0cm; mso-pagination:widow-orphan"><strong><span style="font-size:9.0pt;font-family: 宋体;mso-bidi-font-family:宋体;mso-font-kerning:0pt">专业名称<span lang="EN-US"> <o:p></o:p></span></span></strong></p></td> </tr> <tr style="mso-yfti-irow:1;height:16.5pt"> <td rowspan="4" style="width:170.0pt;border:solid windowtext 1.0pt; border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt; padding:0cm 5.4pt 0cm 5.4pt;height:16.5pt" width="227"><p class="MsoNormal" style="text-align:center;margin-top:6.0pt;margin-right:0cm; margin-bottom:6.0pt;margin-left:0cm;mso-para-margin-top:.5gd;mso-para-margin-right: 0cm;mso-para-margin-bottom:.5gd;mso-para-margin-left:0cm; mso-pagination:widow-orphan"><span style="font-size:9.0pt;font-family:宋体; mso-bidi-font-family:宋体;mso-font-kerning:0pt">土木工程学院<span lang="EN-US">450 <o:p></o:p></span></span></p></td> <td style="width:183.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0cm 5.4pt 0cm 5.4pt;height:16.5pt" width="244" nowrap=""><p class="MsoNormal" style="text-align:center;margin-top:6.0pt;margin-right:0cm; margin-bottom:6.0pt;margin-left:0cm;mso-para-margin-top:.5gd;mso-para-margin-right: 0cm;mso-para-margin-bottom:.5gd;mso-para-margin-left:0cm; mso-pagination:widow-orphan"><span style="font-size:9.0pt;font-family:宋体; mso-bidi-font-family:宋体;mso-font-kerning:0pt">土木工程<span lang="EN-US"> <o:p></o:p></span></span></p></td> </tr> ...... </tbody> </table>
用pandas解析表格,代码如下:
import pandas as pd url = 'http://www.hnu.edu.cn/xyxk/xkzy/zylb.htm' table = pd.read_html(url) pd.set_option('display.max_rows', None) # 显示全部的行 with open("湖南大学学院与专业.txt", "wt", encoding='utf8') as out_file: # 保存为txt文件 for i in table: out_file.write(str(i)+'\n')
运行结果如下(部分):
非常简洁高效!
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:python简单爬虫 使用pandas解析表格,不规则表格 - Python技术站