Pandas的分层取样

2023年3月27日下午2:31 • python-answer

Pandas分层取样（hierarchical sampling）是指在具有多个层级的数据中，根据定义好的分层规则进行随机抽样的操作。Pandas提供了多种方法进行分层取样，下面逐一介绍这些方法。

1. 取样中每个样本大小相等

方法：使用pd.Series.sample()方法
参数：frac（样本大小）

import pandas as pd

# 创建一个包含分层索引的DataFrame
data = pd.DataFrame({'A': [1, 2, 3, 4, 5],
                  'B': ['a', 'b', 'c', 'd', 'e'],
                  'C': ['foo', 'foo', 'bar', 'bar', 'bar'],
                  'D': [1.0, 2.0, 3.0, 4.0, 5.0]})
data = data.set_index(['B', 'C'])

# 进行分层取样
sampled = data.groupby(level='B', group_keys=False).apply(lambda x: x.sample(frac=0.5))
print(sampled)

2. 每个层级内按fractions(list)进行采样

方法：使用pd.core.groupby.GroupBy.apply()方法
参数：fractions（样本大小的列表）

import pandas as pd

# 创建一个包含分层索引的DataFrame
data = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                  'B': ['a', 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd'],
                  'C': ['foo', 'foo', 'bar', 'bar', 'foo', 'foo', 'foo', 'foo', 'foo', 'foo'],
                  'D': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]})
data = data.set_index(['B', 'C'])

# 定义分层抽样规则
def stratified_sample(group, fractions):
    sample = group.sample(frac=fractions[group.name])
    return sample

# 定义样本大小的字典
fractions = {'a': 0.5, 'b': 0.3, 'c': 0.2, 'd': 0.1}

# 进行分层取样
sampled = data.groupby(level='B', group_keys=False).apply(stratified_sample, fractions=fractions)
print(sampled)

3. 每层级内的取样数量相等

方法：使用pd.MultiIndex.get_level_values()方法和pd.Index.get_loc()方法
参数：n（样本数量）

import pandas as pd
import numpy as np

# 创建一个包含分层索引的DataFrame
data = pd.DataFrame({'A': np.arange(50),
                  'B': ['a', 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd']*5,
                  'C': ['foo']*25+['bar']*25})
data = data.set_index(['B', 'C'])

# 定义分层抽样规则
def stratified_sample(group, n):
    n_total = len(group)
    idx = np.linspace(0, n_total-1, n, dtype=int)
    return group.iloc[idx]

# 定义每个层级的样本数量
n_dict = {'a': 2, 'b': 3, 'c': 4, 'd': 1}

# 进行分层取样
sampled = data.groupby(level='B', group_keys=False).apply(stratified_sample, n=n_dict)
print(sampled)

这就是Pandas的分层取样的几种常用方法了。使用这些方法可以在分层数据中进行随机抽样，很方便实用。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Pandas的分层取样 - Python技术站