Question

我有一个数据框

up-sells.php

第二个数据帧

#Around 100000 rows
df = pd.DataFrame({'text':    [ 'Apple is healthy',  'Potato is round', 'Apple might be green'],
                   'category': ["","", ""],
                   })

所需结果

#Around 3000 rows
df_2 = pd.DataFrame({'keyword':    [ 'Apple ',  'Potato'],
                   'category': ["fruit","vegetable"],
                   })

我目前正在尝试

#Around 100000 rows
df = pd.DataFrame({'text':    [ 'Apple is healthy',  'Potato is round', 'Apple might be green'],
                   'category': ["fruit","vegetable", "fruit"],
                   })

结果是

df.set_index('text')
df_2.set_index('keyword')
df.update(df_2)

您会看到它没有为最后一行添加类别。我该如何实现？

Answer 1

您需要分配DataFrame.set_index的输出，因为没有DataFrame.update这样的就地操作，df_2["keyword"]列使用Series.str.extract进行匹配：

df = df.set_index(df['text'].str.extract(f'({"|".join(df_2["keyword"])})', expand=False))
df_2 = df_2.set_index('keyword')
print (df)
                        text category
text                                 
Apple       Apple is healthy         
Potato       Potato is round         
Apple   Apple might be green  



df.update(df_2)
print (df)
                        text   category
text                                   
Apple       Apple is healthy      fruit
Potato       Potato is round  vegetable
Apple   Apple might be green      fruit

如果只需要添加一列，请使用Series.str.extract和Series.map：

s = df['text'].str.extract(f'({"|".join(df_2["keyword"])})', expand=False)
df['category'] = s.map(df_2.set_index(['keyword'])['category'])
print (df)
                   text   category
0      Apple is healthy      fruit
1       Potato is round  vegetable
2  Apple might be green      fruit

根据条件为另一个数据框的数据框列设置值

1 个答案: