DataFrame.groupby()所见的各种用法详解

在pandas中，groupby()是一个非常常用的函数，它可以对DataFrame对象进行拆分-应用-合并的操作，它可以让我们通过对一列或多列的内部分组，来进行数据放缩和聚合计算等操作。

本文将会详细讲解DataFrame.groupby()的各种用法，包括基础用法、多关键字分组、函数应用、数据聚合、数据变换、过滤和组索引等方面。

基础用法

首先我们来看DataFrame.groupby()的一个最基础的用法，DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False)函数的第一个参数by就是我们分组的依据，可以是列名、列表或Series对象。

以下是一个示例：

import pandas as pd
import numpy as np

df = pd.DataFrame({'A':['foo','bar','foo','bar','foo','bar','foo','foo'],
                   'B':['one','one','two','three','two','two','one','three'],
                   'C':np.random.randn(8),
                   'D':np.random.randn(8)})
print(df)
"""
     A      B         C         D
0  foo    one -1.267263 -0.152540
1  bar    one  0.381033  0.262162
2  foo    two  0.770615 -0.653483
3  bar  three  0.365949 -0.366258
4  foo    two  0.240061  0.390593
5  bar    two  0.456706 -0.462931
6  foo    one -1.638039 -1.457516
7  foo  three  1.381375  1.383267
"""

grouped = df.groupby('A')
print(grouped.mean())
"""
            C         D
A                      
bar  0.401229 -0.189676
foo  0.097350 -0.099324
"""

可以看出，我们通过按列A分组，然后求出每组的均值。

多关键字分组

在pandas中，支持对多个关键字进行分组，这样可以得到多层次的group，多层次的group存在于多个索引中，其中，groupby()函数中的by参数可以设置为列名、 column index（列索引）和函数。

以下是一个示例：

```python
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':['foo','bar','foo','bar','foo','bar','foo','foo'],
'B':['one','one','two','three','two','two','one','three'],
'C':np.random.randn(8),
'D':np.random.randn(8)})
print(df)
"""
A B C D
0 foo one -0.088104 -0.304604
1 bar one -1.781027 -1.603775
2 foo two -1.162422 -0.143913
3 bar three -0.155331 -1.086283
4 foo two -1.171434 -1.225775
5 bar two -1.162092 -0.312215
6 foo one 0.102396 0.371927
7 foo three 1.238346 -0.123702
"""

grouped = df.groupby(['A','B'])
print(grouped.mean())
"""
C D
A B
bar one -1.781027 -1.603775
three -0.155331 -1.086283
two -1.162092 -0.312215
foo one 0.007146 0.033661
three 1.238346 -0.123702
two -1.166928 -0.684844
"""

除了可以设置多列作为by参数以外，pandas还允许我们通过传递函数来设置分组键，比如：

```python
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':['foo','bar','foo','bar','foo','bar','foo','foo'],
'B':['one','one','two','three','two','two','one','three'],
'C':np.random.randn(8),
'D':np.random.randn(8)})
print(df)
"""
A B C D
0 foo one 1.120573 -1.074750
1 bar one 0.480526 0.496865
2 foo two -0.109486 -0.327503
3 bar three 0.750564 0.774156
4 foo two 0.553905 0.174869
5 bar two -0.077693 -0.827651
6 foo one -0.713436 0.552360
7 foo three -1.439282 -0.570739
"""

def func(x):
if x == 'foo':
return 1
else:
return 2

grouped = df.groupby(func)
print(grouped.mean())
"""
C D
1 -0.315545 -0.049753
2 0.384132 -0.185210
"""

上面的代码中，我们通过传递一个函数调用来设置分组键，我们定义了一个函数func，如果某个A列元素的值为foo，则返回1，否则返回2，我们可以把这个函数传递到by参数中，得到和之前的分组一样的效果，最终得到每个分组的平均值。

函数应用

Python中内置了很多强大的函数，我们可以把这些函数应用到DataFrame对象上，groupby与其他函数结合使用，可以完成各种各样的操作。Pandas扩展了这个功能，使得用户可以以自己的意愿用函数进行组操作。

在groupby对象上调用apply()函数，可以将一个函数应用到每一组上：

```python
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':['foo','bar','foo','bar','foo','bar','foo','foo'],
'B':['one','one','two','three','two','two','one','three'],
'C':np.random.randn(8),
'D':np.random.randn(8)})
print(df)
"""
A B C D
0 foo one -2.491518 0.798655
1 bar one -0.399449 2.135082
2 foo two 0.659347 1.178351
3 bar three -1.453829 -0.483430
4 foo two -0.538053 -0.450482
5 bar two 0.405166 0.894851
6 foo one -1.460457 1.441877
7 foo three -1.035417 -0.436459
"""

grouped = df.groupby('A')
print(grouped.apply(lambda x: x.sum()))
"""
A B C D
A
bar barbarbar onethreetwo -1.448112 2.546504
foo foofoofoo onetwoonetwothree -4.866098 2.532942
"""

上面的代码中，我们使用了apply()函数，传递了一个lambda函数，这个lambda函数的作用是将每个组分别进行求和操作，最终得到的结果是以分组依据为index的DataFrame对象。

数据聚合

在数据处理的场景中，我们通常会有对每个组内元素进行聚合的需求，比如求均值、求和、计算标准差等等，这些需求都可以通过实用agg()函数实现。

在groupby对象上调用agg()函数，可以使用前面提到的内置函数，完成上述的聚合操作，也可以用自定义函数完成聚合操作。

以下是一个示例：

```python
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':['foo','bar','foo','bar','foo','bar','foo','foo'],
'B':['one','one','two','three','two','two','one','three'],
'C':np.random.randn(8),
'D':np.random.randn(8)})
print(df)
"""
A B C D
0 foo one 0.591862 -0.244575
1 bar one -0.787763 0.706172
2 foo two 0.036479 -2.146537
3 bar three -0.108477 1.165005
4 foo two 0.483595 -0.058508
5 bar two -0.903083 -0.239948
6 foo one -1.160032 1.601426
7 foo three 0.041289 1.386960
"""

grouped = df.groupby('A')
print(grouped['C'].agg(np.mean))
"""
A
bar -0.599108
foo -0.005360
Name: C, dtype: float64
"""

使用agg函数，传递的参数可以是一个列表，实现多项聚合操作，比如：

```python
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':['foo','bar','foo','bar','foo','bar','foo','foo'],
'B':['one','one','two','three','two','two','one','three'],
'C':np.random.randn(8),
'D':np.random.randn(8)})
print(df)
"""
A B C D
0 foo one -0.204122 -1.985485
1 bar one 1.104584 -0.062507
2 foo two -1.493017 2.088506
3 bar three -1.270319 -1.729019
4 foo two 0.365857 1.435928
5 bar two -0.467735 0.502714
6 foo one -1.093101 -0.341111
7 foo three -1.144569 -2.056467
"""

grouped = df.groupby('A')
print(grouped['C'].agg([np.sum, np.mean, np.std]))
"""
sum mean std
A
bar -0.633470 -0.211157 1.188712
foo -3.568952 -0.713790 0.621723
"""

上面的代码中，我们使用agg函数实现了对每个组的列C进行聚合，我们求得了每个组的C列的和、均值和标准差。

数据变换

与agg函数的不同之处在于，transform函数不会对数据进行压缩，它会对数据进行更改，但保持原始数据的大小，例如，我们可以通过transform()函数来对每个组进行数据标准化处理。

以下是一个示例：

```python
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':['foo','bar','foo','bar','foo','bar','foo','foo'],
'B':['one','one','two','three','two','two','one','three'],
'C':np.random.randn(8),
'D':np.random.randn(8)})
print(df)
"""
A B C D
0 foo one -1.360340 -1.002106
1 bar one -0.617154 -0.627038
2 foo two -0.905853 -0.710362
3 bar three -0.617211 -0.174208
4 foo two 1.170265 -0.066579
5 bar two 1.540249 -1.309710
6 foo one 0.315212 -0.054722
7 foo three -0.898049 -1.185755
"""

grouped = df.groupby('A')
score = lambda x: (x - x.mean()) / x.std()
print(grouped.transform(score))
"""
C D
0 0.017119 0.937957
1 -0.410946 0.637261
2 -0.442442 -0.189711
3 -0.212349 1.085165
4 0.258296 0.241738
5 0.376127 -0.909278
6 -0.034047 0.441679
7 -0.552758 0.026190
"""

上面的代码中，我们先定义了一个名为score的标准化函数，然后对每个组进行transform操作，得到标准化后的结果。

过滤

过滤操作是指可以通过某些条件，从groupby对象中选择不同的组，进行筛选操作。

以下是一个示例：

```python
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':['foo','bar','foo','bar','foo','bar','foo','foo'],
'B':['one','one','two','three','two','two','one','three'],
'C':np.random.randn(8),
'D':np.random.randn(8)})
print(df)
"""
A B C D
0 foo one 0.431915 -1.233152
1 bar one 0.703723 -0.601552
2 foo two 0.782649 0.268690
3 bar three -0.172513 0.105664
4 foo two -0.421928 0.035700
5 bar two -1.315922 -1.590813
6 foo one -0.748517 1.518413
7 foo three -0.987955 -0.594296
"""

grouped = df.groupby('A')
print(grouped.filter(lambda x: len(x) >= 3))
"""
A B C D
0 foo one 0.431915 -1.233152
2 foo two 0.782649 0.268690
4 foo two -0.421928 0.035700
6 foo one -0.748517 1.518413
7 foo three -0.987955 -0.594296
"""

上面的代码中，我们使用了filter函数，参数是一个函数，用于筛选出含有大于等于3个元素的group，最终只保留了符合要求的group。

组索引

尽管groupby()运算得到的是一个新的对象，它的内部是通过group的键进行了索引的，group信息被转变成了一个独立的DataFrame对象，然后这个DataFrame缺省的索引为原DataFrame的标签。

对于这个独立的DataFrame对象，我们可以使用任意常规DataFrame对象进行的操作。例如，我们可以使用reset_index()来移除group keys驻留的index标签。

以下是一个示例：

```python
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':['foo','bar','foo','bar','foo','bar','foo','foo'],
'B':['one','one','two','three','two','two','one','three'],
'C':np.random.randn(8),
'D':np.random.randn(8)})
print(df)
"""
A B C D
0 foo one -0.352356 -1.732711
1 bar one 0.212173 1.459614
2 foo two 0.582794 -1.076563
3 bar three -0.075668 0.990016
4 foo two 0.148267 -0.604720
5 bar two -1.009211 -0.966056
6 foo one 1.241528 -0.309842
7 foo three -0.977597 0.535146
"""

grouped = df.groupby('A')
print(grouped.sum().reset_index())
"""
A C D
0 bar -0.872706 1.483575
1 foo 0.642635 -3.188690
"""

在上面的代码中，我们使用reset_index()函数，将

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：DataFrame.groupby()所见的各种用法详解 - Python技术站

DataFrame.groupby()所见的各种用法详解

DataFrame.groupby()所见的各种用法详解

基础用法

多关键字分组

函数应用

数据聚合

数据变换

过滤

组索引

相关文章

python groupby函数实现分组后选取最值

对DataFrame数据中的重复行,利用groupby累加合并的方法详解

python groupby函数实现分组选取最大值与最小值

Python中的groupby分组功能的实例代码

在Pandas中给多层索引降级的方法