模糊分组,对相似词进行分组

时间:2012-07-18 06:30:24

标签: python algorithm machine-learning nlp cluster-analysis

此问题在此之前提出

What is a good strategy to group similar words?

但没有给出关于如何“分组”项目的明确答案。基于difflib的解决方案基本上是搜索,对于给定项目,difflib可以从列表中返回最相似的单词。但是如何将其用于分组呢?

我想减少

['ape', 'appel', 'apple', 'peach', 'puppy']

['ape', 'appel', 'peach', 'puppy']

['ape', 'apple', 'peach', 'puppy']

我试过的一个想法是,对于每个项目,遍历列表,如果get_close_matches返回多个匹配,则使用它,如果不保持单词不变。这部分工作,但它可以建议苹果为appel,然后申请苹果,这些话只会切换位置,没有任何改变。

我很感激任何指针,图书馆名称等等。

注意:就性能而言,我们有300,000个项目的列表,get_close_matches似乎有点慢。有谁知道基于C / ++的解决方案吗?

谢谢,

注意:进一步调查发现kmedoid是正确的算法(以及层次聚类),因为kmedoid不需要“中心”,它需要/使用数据点本身作为中心(这些点称为medoids,因此名称) 。在单词分组的情况下,medoid将是该组/集群的代表元素。

5 个答案:

答案 0 :(得分:5)

您需要规范化群组。在每个组中,选择一个单词或代表该组的编码。然后按他们的代表分组。

一些可能的方法:

  • 选择第一个遇到的单词。
  • 选择词典第一个字。
  • 为所有单词创建一个模式。
  • 选择一个唯一索引。
  • 使用soundex作为模式。
但是,对这些词进行分组可能很困难。如果A类似于B,并且B类似于C,则A和C不一定彼此相似。如果B是代表,则A和C都可以包括在该组中。但如果A或C是代表,则不能包括另一个。


选择第一个替代词(第一个遇到的词):

class Seeder:
    def __init__(self):
        self.seeds = set()
        self.cache = dict()

    def get_seed(self, word):
        LIMIT = 2
        seed = self.cache.get(word,None)
        if seed is not None:
            return seed
        for seed in self.seeds:
            if self.distance(seed, word) <= LIMIT:
                self.cache[word] = seed
                return seed
        self.seeds.add(word)
        self.cache[word] = word
        return word

    def distance(self, s1, s2):
        l1 = len(s1)
        l2 = len(s2)
        matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
        for zz in xrange(0,l2):
            for sz in xrange(0,l1):
                if s1[sz] == s2[zz]:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                else:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
        return matrix[l2][l1]

import itertools

def group_similar(words):
    seeder = Seeder()
    words = sorted(words, key=seeder.get_seed)
    groups = itertools.groupby(words, key=seeder.get_seed)
    return [list(v) for k,v in groups]

示例:

import pprint

print pprint.pprint(group_similar([
    'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
    'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
    'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
    'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
    'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
    'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
    'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
    'people', 'into', 'year', 'your', 'good', 'some', 'could',
    'them', 'see', 'other', 'than', 'then', 'now', 'look',
    'only', 'come', 'its', 'over', 'think', 'also', 'back',
    'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
    'way', 'even', 'new', 'want', 'because', 'any', 'these',
    'give', 'day', 'most', 'us'
]), width=120)

<强>输出:

[['after'],
 ['also'],
 ['and', 'a', 'in', 'on', 'as', 'at', 'an', 'one', 'all', 'can', 'no', 'want', 'any'],
 ['back'],
 ['because'],
 ['but', 'about', 'get', 'just'],
 ['first'],
 ['from'],
 ['good', 'look'],
 ['have', 'make', 'give'],
 ['his', 'her', 'if', 'him', 'its', 'how', 'us'],
 ['into'],
 ['know', 'new'],
 ['like', 'time', 'take'],
 ['most'],
 ['of', 'I', 'it', 'for', 'not', 'he', 'you', 'do', 'by', 'we', 'or', 'my', 'so', 'up', 'out', 'go', 'me', 'now'],
 ['only'],
 ['over', 'our', 'even'],
 ['people'],
 ['say', 'she', 'way', 'day'],
 ['some', 'see', 'come'],
 ['the', 'be', 'to', 'that', 'this', 'they', 'there', 'their', 'them', 'other', 'then', 'use', 'two', 'these'],
 ['think'],
 ['well'],
 ['what', 'who', 'when', 'than'],
 ['with', 'will', 'which'],
 ['work'],
 ['would', 'could'],
 ['year', 'your']]

答案 1 :(得分:3)

你必须在封闭的匹配词中决定你想要使用哪些词。可以从列表中获取get_close_matches返回的第一个元素,或者只是在该列表上使用随机函数并从闭合匹配中获取一个元素。

必须有某种规则,因为它......

In [19]: import difflib

In [20]: a = ['ape', 'appel', 'apple', 'peach', 'puppy']

In [21]: a = ['appel', 'apple', 'peach', 'puppy']

In [22]: b = difflib.get_close_matches('ape',a)

In [23]: b
Out[23]: ['apple', 'appel']

In [24]: import random

In [25]: c = random.choice(b)

In [26]: c
Out[26]: 'apple'

In [27]: 

现在从初始列表中删除c,就是这样...... 对于c ++,您可以使用Levenshtein_distance

答案 2 :(得分:3)

这是使用Affinity Propagation算法的另一个版本。

import numpy as np
import scipy.linalg as lin
import Levenshtein as leven
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import AffinityPropagation
import itertools

words = np.array(
    ['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
     'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
     'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
     'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
     'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
     'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
     'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
     'people', 'into', 'year', 'your', 'good', 'some', 'could',
     'them', 'see', 'other', 'than', 'then', 'now', 'look',
     'only', 'come', 'its', 'over', 'think', 'also', 'back',
     'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
     'way', 'even', 'new', 'want', 'because', 'any', 'these',
     'give', 'day', 'most', 'us'])

print "calculating distances..."

(dim,) = words.shape

f = lambda (x,y): -leven.distance(x,y)

res=np.fromiter(itertools.imap(f, itertools.product(words, words)), dtype=np.uint8)
A = np.reshape(res,(dim,dim))

af = AffinityPropagation().fit(A)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

unique_labels = set(labels)
for i in unique_labels:
    print words[labels==i]

距离必须转换为相似之处,我通过消除距离来做到这一点。输出是

['to' 'you' 'do' 'by' 'so' 'who' 'go' 'into' 'also' 'two']
['it' 'with' 'at' 'if' 'get' 'its' 'first']
['of' 'for' 'from' 'or' 'your' 'look' 'after' 'work']
['the' 'be' 'have' 'I' 'he' 'we' 'her' 'she' 'me' 'give']
['this' 'his' 'which' 'him']
['and' 'a' 'in' 'an' 'my' 'all' 'can' 'any']
['on' 'one' 'good' 'some' 'see' 'only' 'come' 'over']
['would' 'could']
['but' 'out' 'about' 'our' 'most']
['make' 'like' 'time' 'take' 'back']
['that' 'they' 'there' 'their' 'when' 'them' 'other' 'than' 'then' 'think'
 'even' 'these']
['not' 'no' 'know' 'now' 'how' 'new']
['will' 'people' 'year' 'well']
['say' 'what' 'way' 'want' 'day']
['because']
['as' 'up' 'just' 'use' 'us']

答案 3 :(得分:1)

另一种方法可能是使用矩阵分解,使用SVD。首先我们创建单词距离矩阵,对于100个单词,这将是100 x 100矩阵,表示从每个单词到所有其他单词的距离。然后,在这个矩阵上运行SVD,得到的u,s,v中的u可以看作每个簇的隶属度。

代码

import numpy as np
import scipy.linalg as lin
import Levenshtein as leven
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import itertools

words = np.array(
    ['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
     'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
     'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
     'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
     'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
     'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
     'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
     'people', 'into', 'year', 'your', 'good', 'some', 'could',
     'them', 'see', 'other', 'than', 'then', 'now', 'look',
     'only', 'come', 'its', 'over', 'think', 'also', 'back',
     'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
     'way', 'even', 'new', 'want', 'because', 'any', 'these',
     'give', 'day', 'most', 'us'])

print "calculating distances..."

(dim,) = words.shape

f = lambda (x,y): leven.distance(x,y)
res=np.fromiter(itertools.imap(f, itertools.product(words, words)),
                dtype=np.uint8)
A = np.reshape(res,(dim,dim))

print "svd..."

u,s,v = lin.svd(A, full_matrices=False)

print u.shape
print s.shape
print s
print v.shape

data = u[:,0:10]
k=KMeans(init='k-means++', k=25, n_init=10)
k.fit(data)
centroids = k.cluster_centers_
labels = k.labels_
print labels

for i in range(np.max(labels)):
    print words[labels==i]

def dist(x,y):   
    return np.sqrt(np.sum((x-y)**2, axis=1))

print "centroid points.."
for i,c in enumerate(centroids):
    idx = np.argmin(dist(c,data[labels==i]))
    print words[labels==i][idx]
    print words[labels==i]

plt.plot(centroids[:,0],centroids[:,1],'x')
plt.hold(True)
plt.plot(u[:,0], u[:,1], '.')
plt.show()

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = Axes3D(fig)
ax.plot(u[:,0], u[:,1], u[:,2],'.', zs=0,
        zdir='z', label='zs=0, zdir=z')
plt.show()

结果

any
['and' 'an' 'can' 'any']
do
['to' 'you' 'do' 'so' 'go' 'no' 'two' 'how']
when
['who' 'when' 'well']
my
['be' 'I' 'by' 'we' 'my' 'up' 'me' 'use']
your
['for' 'or' 'out' 'about' 'your' 'our']
its
['it' 'his' 'if' 'him' 'its']
could
['would' 'people' 'could']
this
['this' 'think' 'these']
she
['the' 'he' 'she' 'see']
back
['all' 'back' 'want']
one
['of' 'on' 'one' 'only' 'even' 'new']
just
['but' 'just' 'first' 'most']
come
['some' 'come']
that
['that' 'than']
way
['say' 'what' 'way' 'day']
like
['like' 'time' 'give']
in
['in' 'into']
get
['her' 'get' 'year']
because
['because']
will
['with' 'will' 'which']
over
['other' 'over' 'after']
as
['a' 'as' 'at' 'also' 'us']
them
['they' 'there' 'their' 'them' 'then']
good
['not' 'from' 'know' 'good' 'now' 'look' 'work']
have
['have' 'make' 'take']

对于簇数量的k的选择是重要的,例如,k = 25比k = 20给出更好的结果。

代码还通过选择u [..]坐标最接近群集质心的单词为每个群集选择代表性单词。

答案 4 :(得分:0)

这是一种基于medoids的方法。首先安装MlPy。在Ubuntu上

sudo apt-get install python-mlpy

然后

import numpy as np
import mlpy

class distance:    
    def compute(self, s1, s2):
        l1 = len(s1)
        l2 = len(s2)
        matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
        for zz in xrange(0,l2):
            for sz in xrange(0,l1):
                if s1[sz] == s2[zz]:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                else:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
        return matrix[l2][l1]

x =  np.array(['ape', 'appel', 'apple', 'peach', 'puppy'])

km = mlpy.Kmedoids(k=3, dist=distance())
medoids,clusters,a,b = km.compute(x)

print medoids
print clusters
print a

print x[medoids] 
for i,c in enumerate(x[medoids]):
    print "medoid", c
    print x[clusters[a==i]]

输出

[4 3 1]
[0 2]
[2 2]
['puppy' 'peach' 'appel']
medoid puppy
[]
medoid peach
[]
medoid appel
['ape' 'apple']

更大的单词列表并使用k = 10

medoid he
['or' 'his' 'my' 'have' 'if' 'year' 'of' 'who' 'us' 'use' 'people' 'see'
 'make' 'be' 'up' 'we' 'the' 'one' 'her' 'by' 'it' 'him' 'she' 'me' 'over'
 'after' 'get' 'what' 'I']
medoid out
['just' 'only' 'your' 'you' 'could' 'our' 'most' 'first' 'would' 'but'
 'about']
medoid to
['from' 'go' 'its' 'do' 'into' 'so' 'for' 'also' 'no' 'two']
medoid now
['new' 'how' 'know' 'not']
medoid time
['like' 'take' 'come' 'some' 'give']
medoid because
[]
medoid an
['want' 'on' 'in' 'back' 'say' 'and' 'a' 'all' 'can' 'as' 'way' 'at' 'day'
 'any']
medoid look
['work' 'good']
medoid will
['with' 'well' 'which']
medoid then
['think' 'that' 'these' 'even' 'their' 'when' 'other' 'this' 'they' 'there'
 'than' 'them']