Python实现读取HTML表格 pd.read_html()

当我们需要从HTML页面中读取表格数据进行进一步处理和分析时，Python中pd.read_html()函数是一个非常方便实用的方法。

1. pd.read_html()函数简介

pd.read_html()函数位于pandas模块中，可以直接从HTML页面中读取表格内容，并返回一个DataFrame类型的数据结构，可以直接用于进一步的数据处理和分析。

2. pd.read_html()基本用法

pd.read_html()函数的基本用法如下所示：

import pandas as pd
table = pd.read_html(url)

其中，url参数是HTML页面的链接地址或文件路径。函数返回的table变量是一个包含所有HTML页面中的表格数据的列表，每个元素都是一个DataFrame类型的数据结构。

如果HTML页面中只包含一个表格，可以通过下标方式直接获取到该表格的DataFrame数据：

import pandas as pd
tables = pd.read_html(url)
table = tables[0]

3. 示例说明

下面通过两个示例说明pd.read_html()函数的用法。

示例1：读取Wikipedia页面的表格数据

我们访问Wikipedia上的一个页面（https://en.wikipedia.org/wiki/List_of_S%26P_500_companies），它包含了标准普尔500指数的所有公司信息，其中表格的部分内容如下所示：

Ticker symbol	Security	SEC filings	GICS Sector	GICS Sub Industry	Headquarters Location	Date first added	CIK
MMM	3M Company	SEC filings	Industrials	Industrial Conglomerates	St. Paul, Minnesota	1976-08-09	66740
ABT	Abbott Laboratories	SEC filings	Health Care	Health Care Equipment	North Chicago, Illinois	1964-03-31	1800
...	...	...	...	...	...	...	...

现在我们可以通过pd.read_html()函数轻松地将这个表格读取并加载成一个DataFrame数据类型：

import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
tables = pd.read_html(url)
df = tables[0]  # 获取第一个表格
print(df.head())  # 打印前5行

输出结果：

  Ticker symbol                Security SEC filings             GICS Sector  \
0           MMM              3M Company     reports             Industrials   
1           AOS         A. O. Smith Corp     reports             Industrials   
2           ABT     Abbott Laboratories     reports             Health Care   
3          ABBV             AbbVie Inc.     reports             Health Care   
4           ACN           Accenture plc     reports  Information Technology   

                            GICS Sub Industry Headquaters Location  \
0                       Industrial Conglomerates       St. Paul, MN   
1                Electrical Components & Equipment       Milwaukee, WI   
2                           Health Care Equipment  North Chicago, IL   
3                                 Pharmaceuticals  North Chicago, IL   
4  IT Consulting & Other Professional Services           Dublin, Ireland   

  Date first added      CIK  
0       1976-08-09  66740.0  
1       2017-05-03  91142.0  
2       1964-03-31   1800.0  
3       2012-12-31  155115.0  
4       2011-07-06  146737.0

示例2：读取公司股票数据查询网站的表格数据

我们访问一个公司股票数据查询网站（http://quotes.money.163.com/），查询随便一个公司的股票历史走势。打开查询页面，我们可以看到一个包含股票历史数据的表格，如下所示：

日期	开盘价	最高价	最低价	收盘价	涨跌幅	换手率	总市值(亿)	流通市值(亿)	成交量(万股)	成交金额(万元)
2021-05-21	33.99	34.39	33.92	34.17	1.27%	1.90%	1347.95	1308.66	2319.39	79442.33
2021-05-20	32.90	33.91	32.72	33.72	3.59%	2.59%	1313.71	1273.33	2241.02	74855.88
2021-05-19	32.86	33.64	32.70	32.53	-2.01%	1.73%	1282.28	1242.85	2068.12	67241.50
...	...	...	...	...	...	...	...	...	...	...

我们可以通过类似如下代码，将这个表格读取并加载成一个DataFrame数据类型：

import pandas as pd
url = 'http://quotes.money.163.com/trade/lsjysj_600519.html?year=2021&season=1'
tables = pd.read_html(url)
df = tables[3]  # 获取第4个表格
print(df.head())  # 打印前5行