Question

我有一个看起来像这样的数据框：

public async Task<IActionResult> Index()
{                
    if (User == null || User.Identity == null || !User.Identity.IsAuthenticated )
    {
        return View("../Public/Index");
    }

    return View("../Secure/Index");
}

我想基于类别（import pandas as pd import numpy as np d = {'category': [1, 1, 2, 1, 3, 2], 'cost': [33, 33, 18, np.nan, 8, np.nan]} df = pd.DataFrame(data=d) category cost 0 1 33.0 1 1 33.0 2 2 18.0 3 1 NaN 4 3 8.0 5 2 NaN）分组的中位数来替换“费用”列中的NaN。因此，在此示例中，第一个NaN（第3行）将替换为33，第二个（第5行）将替换为18。

执行以下操作：

df[['cost', 'category']].groupby(['category']).median()

但仅适用于nan值

Answer 1

这是一种方法。

df = df.replace(np.nan, df.groupby("category").transform("median"))

您可以将Series作为第二个参数传递给replace。使用groupby + transform，我们可以确保组中位数与相应的组行对齐。

   category  cost
0         1  33.0
1         1  33.0
2         2  18.0
3         1  33.0
4         3   8.0

Answer 2

设置

df.set_index('category', inplace=True)

`Series.update`

df.cost.update(df.groupby('category').cost.median())
df

          cost
category      
1         33.0
1         33.0
2         18.0
1         33.0
3          8.0

`Series.combine_first`

df['cost'] = (
   df.cost.combine_first(df.groupby('category').cost.median()))
df

          cost
category      
1         33.0
1         33.0
2         18.0
1         33.0
3          8.0

因为行动胜于雄辩：

a = np.random.randint(1, 1000, 100000)
b = np.random.choice((1, 2, 3, np.nan), 100000)
df = pd.DataFrame({'category': a, 'cost': b})

%%timeit 
(df.groupby('category')
   .apply(lambda x: x.cost.fillna(x.cost.median()))
   .reset_index(level=0))

%%timeit
df2 = df.set_index('category')
df2.cost.update(df.groupby('category').cost.median())
df2.reset_index()

%%timeit
df2 = df.set_index('category')
df2['cost'] = (
   df.cost.combine_first(df.groupby('category').cost.median()))
df2.reset_index()

664 ms ± 24.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
17.1 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Answer 3

这是一种可能的方法，

In [82]: df
Out[82]:
   category  cost
0         1  33.0
1         1  33.0
2         2  18.0
3         1   NaN
4         3   8.0
5         2   NaN

In [83]: df.groupby('category').apply(lambda x: x.cost.fillna(x.cost.median())).reset_index(level=0)
Out[83]:
   category  cost
0         1  33.0
1         1  33.0
3         1  33.0
2         2  18.0
5         2  18.0
4         3   8.0

使用函数替换空值

3 个答案:

`Series.update`

`Series.combine_first`