在Dataframes中识别Python上的重复记录

时间:2018-06-12 11:09:13

标签: python pandas

我是Python的新手,需要一些帮助来比较来自两个不同数据帧的数据。

我想要做的是比较一个专栏" New"来自second_dataset(dataframe)和" New" first_dataset(dataframe)中的列。如果first_dataset中的行中的值存在于first_dataset中,我想添加一个"状态列"并添加字符串"是"除此之外我想要它说'#34;不"。我已经在下面复制了我的代码。

到目前为止,我已经尝试过一些事情,但一直都会遇到错误。任何的意见都将会有帮助。请。

for row in second_dataset["New"]:
if row in first_dataset["New"] == second_dataset["New"]:
    second_dataset["Status"] = "Yes"
elif row != first_dataset["New"]:
    second_dataset["Status"] = "No"
else:
    second_dataset["Status"] = "Error"

2 个答案:

答案 0 :(得分:0)

我认为需要按isin比较列,并按numpy.where设置新值:

first_dataset = pd.DataFrame({'New': [5,6,7,8,10]})
second_dataset = pd.DataFrame({'New': [1,4,5]})
print (first_dataset)
   New
0    5
1    6
2    7
3    8
4   10

print (second_dataset)
   New
0    1
1    4
2    5

mask = second_dataset["New"].isin(first_dataset["New"])
second_dataset['Status'] = np.where(mask, 'Yes', 'No')
print (second_dataset)
   New Status
0    1     No
1    4     No
2    5    Yes

<强>详细

print (mask)
0    False
1    False
2     True
Name: New, dtype: bool

<强>计时

np.random.seed(123)
first_dataset = pd.DataFrame({'New': np.random.randint(100, size=500)})
second_dataset = pd.DataFrame({'New': np.random.randint(100, size=1000)})
print (first_dataset)

second_dataset['status_column'] = ['Yes' if x in first_dataset['New'].tolist() else 'No' for x in second_dataset['New'].tolist()]

second_dataset['Status'] = np.where(second_dataset["New"].isin(first_dataset["New"]), 'Yes', 'No')

In [146]: %timeit second_dataset['status_column'] = ['Yes' if x in first_dataset['New'].tolist() else 'No' for x in second_dataset['New'].tolist()]
20.9 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [147]: %timeit second_dataset['Status'] = np.where(second_dataset["New"].isin(first_dataset["New"]), 'Yes', 'No')
455 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

答案 1 :(得分:0)

import pandas as pd
dd1 = {'New': [1,2,3], 'b':[4,5,6]}
dd2 = {'New': [1,2,3], 'b':[4,5,6]}

df1 = pd.DataFrame(dd1)
df2 = pd.DataFrame(dd2)

df1_new = df1['New'].tolist()
df2_new = df2['New'].tolist()
print(df1_new)

df2_status = ['Yes' if x in df1_new else 'No' for x in df2_new]
dd2['status_column'] = df2_status

df2 = pd.DataFrame(dd2)
print(df2)