
时间:2014-08-04 08:17:06

标签: python string hash locality-sensitive-hash

我想使用Locality敏感哈希来大致匹配字符串。我有很多字符串> 10M可能包含拼写错误。对于每个String,我想与所有其他字符串进行比较,并根据某个阈值选择具有编辑距离的字符串。

也就是说,天真的解决方案需要进行O(n ^ 2)比较。为了避免这个问题,我考虑使用Locality Sensitive Hashing。然后接近相似的字符串会产生相同的桶,我只需要在桶搜索中进行。所以它是O(n * C),其中C是桶大小。



1 个答案:

答案 0 :(得分:10)

我在这个主题上发现的最好的学术资源是“大规模数据集挖掘”Chapter 3,它提供了对局部敏感哈希和minhashing的精彩概述。


Python datasketch库(pip install datasketch)中有一个很棒的实现。这是一个示例,显示您可以捕获模糊字符串相似性:

from datasketch import MinHash, MinHashLSH
from nltk import ngrams

data = ['minhash is a probabilistic data structure for estimating the similarity between datasets',
  'finhash dis fa frobabilistic fata ftructure for festimating the fimilarity fetween fatasets',
  'weights controls the relative importance between minizing false positive',
  'wfights cfntrols the rflative ifportance befween minizing fflse posftive',

# Create an MinHashLSH index optimized for Jaccard threshold 0.5,
# that accepts MinHash objects with 128 permutations functions
lsh = MinHashLSH(threshold=0.5, num_perm=128)

# Create MinHash objects
minhashes = {}
for c, i in enumerate(data):
  minhash = MinHash(num_perm=128)
  for d in ngrams(i, 3):
  lsh.insert(c, minhash)
  minhashes[c] = minhash

for i in xrange(len(minhashes.keys())):
  result = lsh.query(minhashes[i])
  print "Candidates with Jaccard similarity > 0.5 for input", i, ":", result


Candidates with Jaccard similarity > 0.5 for input 0 : [0, 1]
Candidates with Jaccard similarity > 0.5 for input 1 : [0, 1]
Candidates with Jaccard similarity > 0.5 for input 2 : [2, 3]
Candidates with Jaccard similarity > 0.5 for input 3 : [2, 3]