Question

我的问题与简单的单词相似性有点不同。问题是，有什么算法可以用来计算邮件地址和姓名之间的相似性。

    for example:
    mail Abd_tml_1132@gmail.com
    Name Abdullah temel
    levenstein,hamming distance  11
    jaro distance  0.52

但最有可能的是，该邮件地址属于该名称。

Answer 1

没有直接包装，但这可以解决您的问题：

将电子邮件ID放入列表

a = 'Abd_tml_1132@gmail.com'
rest = a.split('@', 1)[0] # Removing @
result = ''.join([i for i in rest if not i.isdigit()]) ## Removing digits as no names contains digits in them
list_of_email_words =result.split('_') # making a list of all the words. The separator can be changed from _ or . w.r.t to email id
list_of_email_words = list(filter(None, list_of_email_words )) # remove any blank values

将名称命名为列表：

b = 'Abdullah temel'
list_of_name_words =b.split(' ')

将模糊匹配应用于两个列表：

score =[]
for i in range(len(list_of_email_words)):
    for j in range(len(list_of_name_words)):
        d = fuzz.partial_ratio(list_of_email_words[i],list_of_name_words[j])
        score.append(d)

现在，您只需要检查score的任何元素是否大于您可以定义的阈值。例如：

threshold = 70
if any(x>threshold for x in score):
    print ("matched")

Answer 2

Fuzzywuzzy可以帮助您提供所需的解决方案。首先使用正则表达式从字符串中删除“ @”和域名。之后，您将拥有2个字符串-

import pandas as pd 
from scipy.spatial import cKDTree

dataset1 = pd.DataFrame(pd.np.random.rand(100,3))
dataset2 = pd.DataFrame(pd.np.random.rand(10, 3))

ck = cKDTree(dataset1.values)

ck.query_ball_point(dataset2.values, r=0.1)

输出-

from fuzzywuzzy import fuzz as fz
str1 = "Abd_tml_1132"
str2 = "Abdullah temel"

count_ratio = fz.ratio(str1,str2)
print(count_ratio)

邮件地址和名称之间的单词相似性

2 个答案: