假设我们有一个名为any_csv.csv
的文件,其中包含...
A,B,random
1,2,300
3,4,300
5,6,300
1,2,300
3,4,350
8,9,350
4,5,350
5,6,320
7,8,300
3,3,300
我希望保留random
变量/变化的所有行。
我制作了这个小程序来实现这个目标,但是,由于我希望了解更多关于大熊猫的信息,并且因为我的程序比预期的要慢(处理120万行日志文件约130秒),我请求你的帮助
import pandas as pd
import numpy as np
df = pd.read_csv('any_csv.csv')
mask = np.zeros(len(df.index), dtype=bool)
# Initializing my current value for comparison purposes.
mask[0] = 1
previous_val = df.iloc[0]['random']
for index, row in df.iterrows():
if row['random'] != previous_val:
# If a variation has been detected, switch to True current, and previous index.
previous_val = row['random']
mask[index] = 1
mask[index - 1] = 1
# Keeping the last item.
mask[-1] = 1
df = df.loc[mask]
df.to_csv('any_other_csv.csv', index=False)
我想简而言之,我想知道如何应用我的if,在这个自制的for循环中,这套装很慢。
结果:
A,B,random
1,2,300
1,2,300
3,4,350
4,5,350
5,6,320
7,8,300
3,3,300
答案 0 :(得分:3)
您可以使用pd.Series.shift
创建布尔值的掩码。布尔掩码指示值与系列中的值之上或之下的值不同。
然后,您可以直接将布尔蒙版应用于数据框。
mask = (df['random'] != df['random'].shift()) | \
(df['random'] != df['random'].shift(-1))
df = df[mask]
print(df)
A B random
0 1 2 300
3 1 2 300
4 3 4 350
6 4 5 350
7 5 6 320
8 7 8 300
9 3 3 300
答案 1 :(得分:2)
使用带有2个掩码的boolean indexing
来检查shift
和ne
的不同值是否相等:
df = df[df['random'].ne(df['random'].shift()) | df['random'].ne(df['random'].shift(-1))]
print (df)
A B random
0 1 2 300
3 1 2 300
4 3 4 350
6 4 5 350
7 5 6 320
8 7 8 300
9 3 3 300
为了更好地验证:
df['mask1'] = df['random'].ne(df['random'].shift())
df['mask2'] = df['random'].ne(df['random'].shift(-1))
df['mask3'] = df['random'].ne(df['random'].shift()) | df['random'].ne(df['random'].shift(-1))
print (df)
A B random mask1 mask2 mask3
0 1 2 300 True False True
1 3 4 300 False False False
2 5 6 300 False False False
3 1 2 300 False True True
4 3 4 350 True False True
5 8 9 350 False False False
6 4 5 350 False True True
7 5 6 320 True True True
8 7 8 300 True False True
9 3 3 300 False True True
<强>计时强>:
N = 1000
In [157]: %timeit orig(df)
56.8 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [158]: %timeit (df[df['random'].ne(df['random'].shift()) |
df['random'].ne(df['random'].shift(-1))])
939 µs ± 7.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#jpp solution - a bit slowier
In [159]: %timeit df[(df['random'] != df['random'].shift()) | (df['random'] != df['random'].shift(-1))]
1.11 ms ± 8.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N = 10000
In [160]: %timeit orig(df)
538 ms ± 3.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [161]: %timeit (df[df['random'].ne(df['random'].shift()) | df['random'].ne(df['random'].shift(-1))])
1.16 ms ± 75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#jpp solution - a bit slowier
In [162]: %timeit df[(df['random'] != df['random'].shift()) | (df['random'] != df['random'].shift(-1))]
1.28 ms ± 8.51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.random.seed(123)
N = 1000
df = pd.DataFrame({'random':np.random.randint(2, size=N)})
print (df)
def orig(df):
mask = np.zeros(len(df.index), dtype=bool)
# Initializing my current value for comparison purposes.
mask[0] = 1
previous_val = df.iloc[0]['random']
for index, row in df.iterrows():
if row['random'] != previous_val:
# If a variation has been detected, switch to True current, and previous index.
previous_val = row['random']
mask[index] = 1
mask[index - 1] = 1
# Keeping the last item.
mask[-1] = 1
return df.loc[mask]
答案 2 :(得分:0)
您可以尝试以下内容:`
df.groupby(["A", "Random"]).filter(lambda df:df.shape[0] == 1)