Question

我有以下pandas数据框：

new = pd.Series(np.array([0, 1, 0, 0, 2, 2]))
df = pd.DataFrame(new, columns=['a'])

我输出每个值的出现次数：

print df['a'].value_counts()

然后我有以下内容：

0    3
2    2
1    1
dtype: int64

现在我要删除列'a'值小于2的行。我可以遍历df ['a']中的每个值，如果其值小于2，则删除它，但它需要一个长时间用于具有多列的大型数据框。我无法弄清楚这是一种有效的方法。

Answer 1

您可以根据条件为value_counts指定子集，然后获取Series的索引，然后使用isin，您可以检查原始值中的值，然后传递值到原始的DataFrame：

s = df['a'].value_counts()
df[df.isin(s.index[s >= 2]).values]

工作原理：

In [133]: s.index[s >= 2]
Out[133]: Int64Index([0, 2], dtype='int64')


In [134]: df.isin(s.index[s >= 2]).values
Out[134]:
array([[ True],
       [False],
       [ True],
       [ True],
       [ True],
       [ True]], dtype=bool)


In [135]: df[df.isin(s.index[s >= 2]).values]
Out[135]:
   a
0  0
2  0
3  0
4  2
5  2

Answer 2

一种方法是将计数数据与原始df连接。

df2 = pd.DataFrame(df['a'].value_counts())
df2.reset_index(inplace=True)
df2.columns = ['a','counts']

# df2 = 
#   a   counts
# 0 0   3
# 1 2   2
# 2 1   1

df3 = df.merge(df2,on='a')

# df3 = 
#   a   counts
# 0 0   3
# 1 0   3
# 2 0   3
# 3 1   1
# 4 2   2
# 5 2   2

# filter
df3[df3.counts>=2]

在条件下从pandas数据框列中删除低计数

2 个答案: