我是Python的新手,需要一些帮助来比较来自两个不同数据帧的数据。
我想要做的是比较一个专栏" New"来自second_dataset(dataframe)和" New" first_dataset(dataframe)中的列。如果first_dataset中的行中的值存在于first_dataset中,我想添加一个"状态列"并添加字符串"是"除此之外我想要它说'#34;不"。我已经在下面复制了我的代码。
到目前为止,我已经尝试过一些事情,但一直都会遇到错误。任何的意见都将会有帮助。请。
for row in second_dataset["New"]:
if row in first_dataset["New"] == second_dataset["New"]:
second_dataset["Status"] = "Yes"
elif row != first_dataset["New"]:
second_dataset["Status"] = "No"
else:
second_dataset["Status"] = "Error"
答案 0 :(得分:0)
我认为需要按isin
比较列,并按numpy.where
设置新值:
first_dataset = pd.DataFrame({'New': [5,6,7,8,10]})
second_dataset = pd.DataFrame({'New': [1,4,5]})
print (first_dataset)
New
0 5
1 6
2 7
3 8
4 10
print (second_dataset)
New
0 1
1 4
2 5
mask = second_dataset["New"].isin(first_dataset["New"])
second_dataset['Status'] = np.where(mask, 'Yes', 'No')
print (second_dataset)
New Status
0 1 No
1 4 No
2 5 Yes
<强>详细强>:
print (mask)
0 False
1 False
2 True
Name: New, dtype: bool
<强>计时强>:
np.random.seed(123)
first_dataset = pd.DataFrame({'New': np.random.randint(100, size=500)})
second_dataset = pd.DataFrame({'New': np.random.randint(100, size=1000)})
print (first_dataset)
second_dataset['status_column'] = ['Yes' if x in first_dataset['New'].tolist() else 'No' for x in second_dataset['New'].tolist()]
second_dataset['Status'] = np.where(second_dataset["New"].isin(first_dataset["New"]), 'Yes', 'No')
In [146]: %timeit second_dataset['status_column'] = ['Yes' if x in first_dataset['New'].tolist() else 'No' for x in second_dataset['New'].tolist()]
20.9 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [147]: %timeit second_dataset['Status'] = np.where(second_dataset["New"].isin(first_dataset["New"]), 'Yes', 'No')
455 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
答案 1 :(得分:0)
import pandas as pd
dd1 = {'New': [1,2,3], 'b':[4,5,6]}
dd2 = {'New': [1,2,3], 'b':[4,5,6]}
df1 = pd.DataFrame(dd1)
df2 = pd.DataFrame(dd2)
df1_new = df1['New'].tolist()
df2_new = df2['New'].tolist()
print(df1_new)
df2_status = ['Yes' if x in df1_new else 'No' for x in df2_new]
dd2['status_column'] = df2_status
df2 = pd.DataFrame(dd2)
print(df2)