Question

我有一个看起来像这样的数据框;

index, othercols, FPN
ts1, otherStuff, val1
ts2, otherStuff, val2
ts3, otherStuff, val3
ts4, otherStuff, val4
....
tsn, otherStuff, valn

由于外部数据源很多这些值将被重复 - 因此在百万行数据帧中，将有多个条目长达10,000个条目，只是重复相同的数据用于并发时间戳。至于我的目的，至少这个重复是没有必要的，所以我想删除所有重复的行，除了每个部分的开头和结尾，像这样;

1, 0
2, 0
3, 0
4, 5
5, 0
6, 0

变为

我设法做到了这一点，但它比我想要的慢（对于一个60mb文件需要2分钟;大部分在应用部分中如下所示）并且我认为必须做一个更好的方法

这是我的拼凑解决方案，有更简洁/更快的方法吗？

data=df['FPN']

shft_up=(copy.deepcopy(data)).tolist()
shft_dn=(copy.deepcopy(data)).tolist()

del shft_up[0]
shft_up=shft_up+[None]

del shft_dn[-1]
shft_dn=[None]+shft_dn

df['shft_up']=shft_up
df['shft_dn']=shft_dn

def is_rep(row):
    if row['shft_dn']==row['FPN'] and row['shft_up']==row['FPN']:
        return 1
    else:
        return 0  

df['mask_col']=df.apply(lambda row:is_rep(row),axis=1,reduce=False)

df=(df[df['mask_col']==0]).drop(['shft_up','shft_dn','mask_col'],axis=1)

Answer 1

我认为我的逻辑是正确的，我添加了一个新列＆＃39;运行＆＃39;这是一个布尔值，表示该值是否与前一行的值相同：

In [438]:

df['run'] = (df['val'] == df['val'].shift())
df
Out[438]:
   id  val    run
0   1    0  False
1   2    0   True
2   3    0   True
3   4    5  False
4   5    0  False
5   6    0   True

然后，我会过滤出运行为True且下一行也为True的值：

In [442]:

df[~((df['run']==True) & (df['run'].shift(-1) == True))]
Out[442]:
   id  val    run
0   1    0  False
2   3    0   True
3   4    5  False
4   5    0  False
5   6    0   True

修改

以下单行也适用于确认OP：

In [447]: df = df[(df['val'].shift()!=df['val'].shift(-1)) | (df['val']!=df['val'].shift(-1))] df Out[447]: id val 0 1 0 2 3 0 3 4 5 4 5 0 5 6 0

从pandas数据帧中过滤出冗余的重复数据

1 个答案: