Question

如何仅从数据框中按降序选择非null列。

以下是数据框：

df = pd.DataFrame( { 'a': [1,2,np.nan,np.nan],
                    'b':  [10,20,30,40],
                   'c': [1,np.nan,np.nan,np.nan]})
     a   b    c
0  1.0  10  1.0
1  2.0  20  NaN
2  NaN  30  NaN
3  NaN  40  NaN

我可以这样做：

df.isnull().sum().sort_values(ascending=False)
c    3
a    2
b    0

但是我想将多个命令链接到一行，以便在一行中显示结果。

我尝试过： df.isnull().sum().sort_values(ascending=False).filter(lambda x: x>0) 失败

我知道这一点：

temp = df.isnull().sum().sort_values(ascending=False)
temp[temp>0]
c    3
a    2

但是我正在寻找一种在单行中链接连续性的方法。

必填：

df.isnull().sum().sort_values(ascending=False).somefunction( x > 0)

更新
我找到了一种将系列转换为数据框然后使用查询的方法。

df.isnull().sum().sort_values(ascending=False).to_frame().rename(columns={0:'temp'}).query("temp > 0")

这看起来很长而且多余。有更好的方法吗？

Answer 1

对于filter感到困惑，因为它适用于index而不是值

df.isnull().sum().loc[lambda x : x>0].sort_values(ascending=False)
Out[147]: 
a    2
c    3
dtype: int64

Answer 2

当然，有很多方法可以做到这一点，但是通常我不建议您在传递python函数的地方使用lambda或过滤器，因为如果您的序列更大，这会使事情变得很慢。就您而言，您可以改为 1.用nan替换0并删除nans。

df.isnull().sum().replace(0, np.nan).dropna().sort_values(ascending=False).astype(int)

这样做的缺点是您需要两次键入convert（nan始终是浮点数，而不是int数）。 2.使用查询功能。

df.isnull().sum().sort_values(ascending=False).to_frame('value').query('value!=0')['value'].rename(None)

此方法的缺点是它仅存在于数据帧中，因此您需要先将序列转换为一个。但是，对于大系列而言，这应该比类型转换便宜，因为底层数组保持不变。

Answer 3

在.loc上将isna与any和axis=0一起使用掩码：

df.loc[:, df.isna().any()].isna().sum().sort_values(ascending=False)

Out[1845]:
c    3
a    2
dtype: int64

Answer 4

可以通过 numpy 使用更有效的方式来回答：

s = data.isnull().sum()
mask = (s.values > 0)
pd.Series(s.values[mask], s.index[mask]).sort_values(ascending=False)

Clear comparison of time complexity among all methods

在熊猫数据框中仅保留非零缺失值

4 个答案: