检查大熊猫的存在

时间:2019-06-06 12:50:43

标签: python pandas dataframe

具有两个同时包含列id1,id2的数据帧F1和F2。

F1包含5列。 F2包含三列[id1,id2,Description]我想测试F2 ['id1']中是否存在F1 ['id1']或F2 ['id2']中是否存在F1 ['id2'] 那么我必须在F1中添加colmun并在F2`中对此id1或id2进行描述。  F1和F2的内容在这里。我也在F1上参加的输出在这里,我像这样enter image description here

创建了F1和F2
 F1 = {'id1': ['x22', 'x13','NaN','x421'],'id2':['NaN',223,788,'NaN']}
 F1 = pd.DataFrame(data=F1)
 F2 = {'id1': ['x22', 'NaN','NaN','x413','x421'],'id2':['NaN','223','788','NaN','233'],'Description':['California','LA','NY','Havnover','Munich']}
 F2 = pd.DataFrame(data=F2)

这就是我所做的:

s1 = F2.drop_duplicates('id1').dropna(subset=['id1']).set_index('id1')['Description']
s2 = F2.drop_duplicates('id2').dropna(subset=['id2']).set_index('id2')['Description']
F1['Description'] = F1['id1'].map(s1).combine_first(F1['id2'].map(s2))

我如何更正我的代码以获得此结果

F1的结果

  F1 = {'id1': ['x22', 'x13','NaN','x421'],'id2':['NaN',223,788,'NaN'],'Name':['NNNN','AAAA','XXXX','OOO'],'V1':['oo','li','la','lo'],'Description':['Clafiornia','LA','NY','Munich']}
  F1 = pd.DataFrame(data=F1)

2 个答案:

答案 0 :(得分:1)

您可以使用isin()函数检查两个df中的ID是否都存在:

F1 = {'id1': ['x22', 'x13','NaN','x421'],'id2':['NaN', 223, 788,'NaN']}
F1['id2'] = [str(x) if ~isinstance(x, str) else x for x in F1['id2']]
F1 = pd.DataFrame(data=F1)
F2 = {'id1': ['x22', 'NaN','NaN','x413','x421'],'id2':['NaN','223','788','NaN','233'],'Description':['California','LA','NY','Havnover','Munich']}
F2 = pd.DataFrame(data=F2)
F1['Description'] = ''

F1['Description'] = ''

id1_F1 = (F1[F1['id1']!='NaN']['id1'].isin(F2['id1']))
id1_F2 = (F2[F2['id1']!='NaN']['id1'].isin(F1['id1']))
id2_F1 = (F1[F1['id2']!='NaN']['id2'].isin(F2['id2']))
id2_F2 = (F2[F2['id2']!='NaN']['id2'].isin(F1['id2']))


F1.loc[id1_F1[id1_F1].index.values, 'Description'] = F2.loc[id1_F2[id1_F2].index.values, 'Description'].values
F1.loc[id2_F1[id2_F1].index.values, 'Description'] = F2.loc[id2_F2[id2_F2].index.values, 'Description'].values

输出:

id1 id2 Description
0   x22 NaN California
1   x13 223 LA
2   NaN 788 NY
3   x421    NaN Munich

答案 1 :(得分:0)

解决方案效果很好,但数据中存在问题-前NaN的值不是缺失的,而是string,所以必要的replace,然后是F2['id2']的第二个问题是数值是数字的字符串表示形式,因此将to_numericerrors='coerce'相加:

F1 = {'id1': ['x22', 'x13','NaN','x421'],'id2':['NaN',223,788,'NaN']}
F1 = pd.DataFrame(data=F1)
F2 = {'id1': ['x22', 'NaN','NaN','x413','x421'],'id2':['NaN','223','788','NaN','233'],
      'Description':['California','LA','NY','Havnover','Munich']}
F2 = pd.DataFrame(data=F2)

#solution for sample data
F1 = F1.replace('NaN', np.nan)
F2 = F2.replace('NaN', np.nan)
F1['id2'] = pd.to_numeric(F1['id2'], errors='coerce').fillna(F1['id2'])
F2['id2'] = pd.to_numeric(F2['id2'], errors='coerce').fillna(F2['id2'])

仅将两个DataFrame中的id列替换为两个列中的DataFrames的值都转换为数字的一般解决方案:

cols = ['id1','id2']
F1[cols] = F1[cols].replace('NaN', np.nan)
F1[cols] = F1[cols].apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(F1[cols])
F2[cols] = F2[cols].replace('NaN', np.nan)
F2[cols] = F2[cols].apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(F2[cols])

具有自定义功能的另一种解决方案:

def func(x):
    try:
        return float(x)
    except Exception:
        return x

cols = ['id1','id2']
F1[cols] = F1[cols].applymap(func)
F2[cols] = F2[cols].applymap(func)

print (F1)
    id1    id2
0   x22    NaN
1   x13  223.0
2   NaN  788.0
3  x421    NaN

print (F2)
    id1    id2 Description
0   x22    NaN  California
1   NaN  223.0          LA
2   NaN  788.0          NY
3  x413    NaN    Havnover
4  x421  233.0      Munich

s1 = F2.drop_duplicates('id1').dropna(subset=['id1']).set_index('id1')['Description']
s2 = F2.drop_duplicates('id2').dropna(subset=['id2']).set_index('id2')['Description']

F1['Description1'] = F1['id1'].map(s1).combine_first(F1['id2'].map(s2))
print (F1)
    id1    id2 Description1
0   x22    NaN   California
1   x13  223.0           LA
2   NaN  788.0           NY
3  x421    NaN       Munich