Question

在PHP中，我使用函数levenshtein（）计算Levenshtein距离。对于简单字符，它按预期工作，但对于像例子中的变音字符

echo levenshtein('à', 'a');

它返回＆＃34; 2＆＃34;。在这种情况下，只需要进行一次更换，所以我希望它能够返回＆＃34; 1＆＃34;。

我错过了什么吗？

Answer 1

我认为发布this comment from the PHP manual作为这个问题的答案可能会有用，所以这里是： -

levenshtein函数分别处理输入字符串的每个字节。然后对于多字节编码，例如UTF-8，它可能会产生误导性的结果。

带有法语重音词的示例： - levenshtein（'notre'，'votre'）= 1 - levenshtein（'notre'，'nôtre'）= 2（嗯？！）

您可以轻松找到levenshtein函数的多字节兼容PHP实现，但它当然比C实现慢得多。

另一个选择是将字符串转换为单字节（无损）编码，以便它们可以提供快速核心levenshtein函数。

这是我用于存储UTF-8字符串的搜索引擎的转换函数，以及快速基准测试。我希望它会有所帮助。

<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
// 
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
    // find all multibyte characters (cf. utf-8 encoding specs)
    $matches = array();
    if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
        return $str; // plain ascii string

    // update the encoding map with the characters not already met
    foreach ($matches[0] as $mbc)
        if (!isset($map[$mbc]))
            $map[$mbc] = chr(128 + count($map));

    // finally remap non-ascii characters
    return strtr($str, $map);
}

// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
    $charMap = array();
    $s1 = utf8_to_extended_ascii($s1, $charMap);
    $s2 = utf8_to_extended_ascii($s2, $charMap);

    return levenshtein($s1, $s2);
}
?>

结果（约6000个电话） - 参考时间核心C函数（单字节）：30 ms - utf8到ext-ascii转换+核心功能：90毫秒 - 完整的PHP实现：3000毫秒

Answer 2

与许多PHP函数一样，默认的PHP levenshtein()不是多字节识别的。因此，在处理具有Unicode字符的字符串时，它会分别处理每个字节并更改两个字节。

没有多字节版本（即mb_levenshtein()），因此您有两个选择：

1）使用mb_函数自行重新实现该功能。 Possible example code from a Gist：

<?php
function levenshtein_php($str1, $str2){
    $length1 = mb_strlen( $str1, 'UTF-8');
    $length2 = mb_strlen( $str2, 'UTF-8');
    if( $length1 < $length2) return levenshtein_php($str2, $str1);
    if( $length1 == 0 ) return $length2;
    if( $str1 === $str2) return 0;
    $prevRow = range( 0, $length2);
    $currentRow = array();
    for ( $i = 0; $i < $length1; $i++ ) {
        $currentRow=array();
        $currentRow[0] = $i + 1;
        $c1 = mb_substr( $str1, $i, 1, 'UTF-8') ;
        for ( $j = 0; $j < $length2; $j++ ) {
            $c2 = mb_substr( $str2, $j, 1, 'UTF-8' );
            $insertions = $prevRow[$j+1] + 1;
            $deletions = $currentRow[$j] + 1;
            $substitutions = $prevRow[$j] + (($c1 != $c2)?1:0);
            $currentRow[] = min($insertions, $deletions, $substitutions);
        }
        $prevRow = $currentRow;
    }
    return $prevRow[$length2];
}

2）Convert your string's Unicode characters to ASCII。如果你特别想要计算Levenshtein从变音符号到非变音符号的差异，那么这可能不是你想要的。

Levenshtein在变音符号上的距离

2 个答案: