Question

我有一个包含数百个列名的大型pandas数据框[df]。我想删除所有超过50％NAN值的列，但只删除列标题中包含“test”一词的列。有没有一种简单的方法来做到这一点＆amp;谢谢你的帮助！

Answer 1

IIUC你可以这样做：

In [122]:
df = pd.DataFrame({'test':np.NaN, 'asd':0,'test 1':[0,1,np.NaN,3,4]})
df

Out[122]:
   asd  test  test 1
0    0   NaN     0.0
1    0   NaN     1.0
2    0   NaN     NaN
3    0   NaN     3.0
4    0   NaN     4.0

In [138]:
cols = df.columns[df.columns.str.contains('test')]
to_remove = cols[df[cols].isnull().sum() > len(df)/2]
to_remove

Out[138]:
Index(['test'], dtype='object')

In [140]:
df.drop(to_remove, axis=1)

Out[140]:
   asd  test 1
0    0     0.0
1    0     1.0
2    0     NaN
3    0     3.0
4    0     4.0

首先，我们得到包含＆＃39; test＆＃39;的cols列表。使用str.contains：

In [142]:
df.columns[df.columns.str.contains('test')]

Out[142]:
Index(['test', 'test 1'], dtype='object')

然后我们使用[NaN]：

测试此isnull值的子集

In [143]:
df[cols].isnull()

Out[143]:
   test test 1
0  True  False
1  True  False
2  True   True
3  True  False
4  True  False

如果我们sum这个，它将布尔值转换为int 1和0：

In [144]:
df[cols].isnull().sum()

Out[144]:
test      5
test 1    1
dtype: int64

然后我们可以通过将它与df的半长度进行比较来创建一个布尔掩码：

In [145]:
df[cols].isnull().sum() > len(df)/2

Out[145]:
test       True
test 1    False
dtype: bool

然后我们可以对此进行过滤以使cols放弃：

In [146]:
cols[df[cols].isnull().sum() > len(df)/2]

Out[146]:
Index(['test'], dtype='object')

丢弃具有相似名称的列，在熊猫中具有超过50％的nan

1 个答案: