Question

我有两个不同长度的Pandas Dataframe。 DF1有大约120万行（只有1列），DF2有大约300,000行（和一列），我试图从两个列表中找到类似的项目。

DF1拥有约75％的公司名称和25％的人，而DF2则相反，但它们都是字母数字。我想要的是编写一个功能，突出显示两个列表中最相似的项目，按分数（或百分比）排名。例如，

Apple -> Apple Inc. (0.95) 
Apple -> Applebees (0.68)
Banana Boat -> Banana Bread (0.25)

到目前为止，我尝试了两种方法，两种方法都失败了。

方法1 ：找到两个列表的Jaccard系数。

import numpy as np
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(df_1, df_2)

这不起作用，可能是由于两个数据帧的长度不同而且我得到了这个错误：

ValueError：找到样本数不一致的数组

方法2：：使用序列匹配器

from difflib import SequenceMatcher
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

然后调用Dataframes：

similar(df_1, df_2)

这会导致错误：

pandas.index.IndexEngine.get_loc中的pandas / index.pyx   （熊猫/ index.c：3979）（）

pandas.index.IndexEngine.get_loc中的pandas / index.pyx   （熊猫/ index.c：3843）（）

pandas.hashtable.PyObjectHashTable.get_item中的pandas / hashtable.pyx   （熊猫/ hashtable.c：12265）（）

pandas.hashtable.PyObjectHashTable.get_item中的pandas / hashtable.pyx   （熊猫/ hashtable.c：12216）（）

KeyError：0

我怎么能解决这个问题？

Answer 1

解决方案

我必须安装distance模块，因为它比在此上下文中确定如何使用jaccard_similarity_score更快。我无法从该功能重新创建您的号码。

安装`distance`

pip install distance

使用`distance`

import distance

jd = lambda x, y: 1 - distance.jaccard(x, y)
df_1.head().iloc[:, 0].apply(lambda x: df_2.head().iloc[:, 0].apply(lambda y: jd(x, y)))

head()可供您保护。我很确定删除它们会炸毁你的计算机，因为它会产生1.2M X 0.3M矩阵。

试试这个。我不太确定你到底想要什么。我们可以在您获得清晰度时进行调整。

Answer 2

或者比较仅限于同一元素位置的项目。

import distance

jd = lambda x, y: 1 - distance.jaccard(x, y)

test_df = pd.concat([df.iloc[:, 0] for df in [df_1, df_2]], axis=1, keys=['one', 'two'])
test_df.apply(lambda x: jd(x[0], x[1]), axis=1)

找到两个不同长度的DataFrame之间的相似性

2 个答案:

解决方案

安装`distance`

使用`distance`

找到两个不同长度的DataFrame之间的相似性

2 个答案:

解决方案

安装distance

使用distance

安装`distance`

使用`distance`