Question

给出以下输入：

Text  id
$aKropotkin$bPetr Alekseevich$cKniaz',$f1842-1921.  34
$aKropotkin$bPetr Alekseevich$cKniaz',$f1842-1921.  98
$aKropotkin$bPetr Alekseevich$ckniaz',$f1842-1921.  152
$aKropotkin$bPetr Alekseevich$ckniaz',$f1842-1921.  245
$aKropotkin$bPetr Alekseevich$ckniaz,$f1842-1921    365
$aKropotkin$bPetr Alekseevich$ckniaz,$f1842-1921.   654
$aDescartes$bRene$f1596-1650.   964
$aDescartes$bRene$f1596-1650. 1364
$aDescartes$bRene$f1596-1650. 2547
$aDescartes$bRene$f1596-1650. 3547
$aDescartes$bRene$f1596-1650. 3678
$aDescartes$bRene$f1596-1650    54656
$aDescartes$bRené$f1596-1650    698545
$aDescartes$bRené$f1596-1650.   65455233
$aVoltaire,$f1694-1778. 54666
$aVoltaire,$f1694-1778  365421
$aVoltaire$f1694-1778.  654564

我只需要创建真正重复的集群。根据第一栏的文字。

我尝试使用以下示例代码，但所有文本都进入群集： https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/

我需要一种方法来使用一种根本没有误报的算法，输出类似于：

群集1：

$aKropotkin$bPetr Alekseevich$cKniaz',$f1842-1921.  34,98,152,245,365,654
$aKropotkin$bPetr Alekseevich$ckniaz',$f1842-1921.
$aKropotkin$bPetr Alekseevich$ckniaz,$f1842-1921
$aKropotkin$bPetr Alekseevich$ckniaz,$f1842-1921.

群集2：

$aDescartes$bRene$f1596-1650.   964,1364,2547,3547,3678,54656,698545,65455233
$aDescartes$bRene$f1596-1650.
$aDescartes$bRene$f1596-1650.
$aDescartes$bRene$f1596-1650.
$aDescartes$bRene$f1596-1650.
$aDescartes$bRene$f1596-1650
$aDescartes$bRené$f1596-1650
$aDescartes$bRené$f1596-1650.

群集3：

$aVoltaire,$f1694-1778. 54666,365421,654564
$aVoltaire,$f1694-1778
$aVoltaire$f1694-1778.

编辑：

我尝试过的，我认为这是我最接近我尝试做的事情:(但我在这里要求一个优雅高效的解决方案）

# -*- coding: utf-8 -*-

import re, string
from unidecode import unidecode

PUNCTUATION = re.compile('[%s]' % re.escape(string.punctuation))

class Fingerprinter(object):
    '''
    Python implementation of Google Refine fingerprinting algorithm described here:
    https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
    Requires the unidecode module: https://github.com/iki/unidecode
    '''
    def __init__(self, string):
        self.string = self._preprocess(string)

    def _preprocess(self, string):
        '''
        Strip leading and trailing whitespace, lowercase the string, remove all punctuation,
        in that order.
        '''
        return PUNCTUATION.sub('', string.strip().lower())

    def _latinize(self, string):
        '''
        Replaces unicode characters with closest Latin equivalent. For example,
        Alejandro González Iñárritu becomes Alejando Gonzalez Inarritu.
        '''
        return unidecode(string.decode('utf-8'))

    def _unique_preserving_order(self, seq):

        seen = set()
        seen_add = seen.add
        return [x for x in seq if not (x in seen or seen_add(x))]

    def get_fingerprint(self):
        '''
        Gets conventional fingerpint.
        '''
        return self._latinize(' '.join(
            self._unique_preserving_order(
                sorted(self.string.split())
            )
        ))

    def get_ngram_fingerprint(self, n=1):
        '''
        Gets ngram fingerpint based on n-length shingles of the string.
        Default is 1.
        '''
        return self._latinize(''.join(
            self._unique_preserving_order(
                sorted([self.string[i:i + n] for i in range(len(self.string) - n + 1)])
            )
        ))

if __name__ == '__main__':
    f = Fingerprinter('Tom Cruise')
    print f.get_fingerprint()
    print f.get_ngram_fingerprint(n=1)

    f = Fingerprinter('Cruise, Tom')
    print f.get_fingerprint()
    print f.get_ngram_fingerprint(n=1)

    f = Fingerprinter('Paris')
    print f.get_fingerprint()
print f.get_ngram_fingerprint(n=2)

python识别文本中的集群

0 个答案: