如何计算给定2个字符串的距离相似性度量?

时间:2012-02-26 14:05:46

标签: c# .net levenshtein-distance measure similarity

我需要计算2个字符串之间的相似度。那究竟是什么意思呢?让我用一个例子来解释:

  • 真实的词:hospital
  • 误区:haspita

现在我的目标是确定修改错误单词以获得真实单词所需的字符数。在这个例子中,我需要修改2个字母。那么百分比是多少?我总是把真正的词长度。所以它变为2/8 = 25%所以这两个给定的字符串DSM是75%。

如何以性能为关键考虑来实现这一目标?

7 个答案:

答案 0 :(得分:66)

几周前我刚刚解决了这个问题。既然有人现在问,我会分享代码。在我的详尽测试中,即使没有提供最大距离,我的代码也比维基百科上的C#示例快10倍。当提供最大距离时,此性能增益会增加到30x - 100x +。请注意性能的几个关键点:

  • 如果需要反复比较相同的单词,请先将单词转换为整数数组。 Damerau-Levenshtein算法包括许多>,<,==比较,intschars快得多。
  • 如果距离超过提供的最大值
  • ,它包括一个退出的短路机制
  • 使用旋转的三个数组,而不是像我在别处看到的所有实现中的大规模矩阵
  • 确保您的数组切片跨越较短的字宽。

代码(如果在参数声明中用int[]替换String,它的工作方式完全相同:

/// <summary>
/// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
/// integers, where each integer represents the code point of a character in the source string.
/// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
/// </summary>
/// <param name="source">An array of the code points of the first string</param>
/// <param name="target">An array of the code points of the second string</param>
/// <param name="threshold">Maximum allowable distance</param>
/// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {

    int length1 = source.Length;
    int length2 = target.Length;

    // Return trivial case - difference in string lengths exceeds threshhold
    if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }

    // Ensure arrays [i] / length1 use shorter length 
    if (length1 > length2) {
        Swap(ref target, ref source);
        Swap(ref length1, ref length2);
    }

    int maxi = length1;
    int maxj = length2;

    int[] dCurrent = new int[maxi + 1];
    int[] dMinus1 = new int[maxi + 1];
    int[] dMinus2 = new int[maxi + 1];
    int[] dSwap;

    for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }

    int jm1 = 0, im1 = 0, im2 = -1;

    for (int j = 1; j <= maxj; j++) {

        // Rotate
        dSwap = dMinus2;
        dMinus2 = dMinus1;
        dMinus1 = dCurrent;
        dCurrent = dSwap;

        // Initialize
        int minDistance = int.MaxValue;
        dCurrent[0] = j;
        im1 = 0;
        im2 = -1;

        for (int i = 1; i <= maxi; i++) {

            int cost = source[im1] == target[jm1] ? 0 : 1;

            int del = dCurrent[im1] + 1;
            int ins = dMinus1[i] + 1;
            int sub = dMinus1[im1] + cost;

            //Fastest execution for min value of 3 integers
            int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);

            if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
                min = Math.Min(min, dMinus2[im2] + cost);

            dCurrent[i] = min;
            if (min < minDistance) { minDistance = min; }
            im1++;
            im2++;
        }
        jm1++;
        if (minDistance > threshold) { return int.MaxValue; }
    }

    int result = dCurrent[maxi];
    return (result > threshold) ? int.MaxValue : result;
}

Swap的位置:

static void Swap<T>(ref T arg1,ref T arg2) {
    T temp = arg1;
    arg1 = arg2;
    arg2 = temp;
}

答案 1 :(得分:45)

您要找的是编辑距离Levenshtein distance。维基百科的文章解释了它是如何计算的,底部有一段很好的伪代码,可以帮助你很容易地用C#编写这个算法。

以下是第一个网站的实施方案:

private static int  CalcLevenshteinDistance(string a, string b)
    {
    if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
        return 0;
    }
    if (String.IsNullOrEmpty(a)) {
        return b.Length;
    }
    if (String.IsNullOrEmpty(b)) {
        return a.Length;
    }
    int  lengthA   = a.Length;
    int  lengthB   = b.Length;
    var  distances = new int[lengthA + 1, lengthB + 1];
    for (int i = 0;  i <= lengthA;  distances[i, 0] = i++);
    for (int j = 0;  j <= lengthB;  distances[0, j] = j++);

    for (int i = 1;  i <= lengthA;  i++)
        for (int j = 1;  j <= lengthB;  j++)
            {
            int  cost = b[j - 1] == a[i - 1] ? 0 : 1;
            distances[i, j] = Math.Min
                (
                Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
                distances[i - 1, j - 1] + cost
                );
            }
    return distances[lengthA, lengthB];
    }

答案 2 :(得分:35)

可以使用大量的字符串相似性距离算法。这里列出的一些(但没有详尽列出):

包含所有这些实现的库称为SimMetrics 它有java和c#实现。

答案 3 :(得分:16)

我发现LevenshteinJaro Winkler非常适合字符串之间的小差异,例如:

  • 拼写错误;或
  • ö而不是人名中的 o

然而,当比较一些文章标题时,文本的重要部分相同但边缘有“噪音”,Smith-Waterman-Gotoh非常棒:

比较这两个标题(虽然相同,但措辞与不同来源不同):

  

来自大肠杆菌的核酸内切酶,在紫外线照射的DNA中引入单核苷酸链断裂

     

内切核酸酶III:来自大肠杆菌的核酸内切酶,在紫外线照射的DNA中引入单个多核苷酸链断裂

This site that provides algorithm comparison字符串显示:

  • Levenshtein:81
  • Smith-Waterman Gotoh 94
  • Jaro Winkler 78

Jaro Winkler和Levenshtein在检测相似性方面不如Smith Waterman Gotoh能干。如果我们比较不同文章但有一些匹配文本的两个标题:

  

高等植物的脂肪代谢。酰基硫酯酶在酰基辅酶 A和酰基 - 酰基载体蛋白代谢中的作用

     

高等植物脂肪代谢。复合脂质混合物中酰基 - 酰基载体蛋白酰基辅酶 A的测定

Jaro Winkler给出了误报,但Smith Waterman Gotoh没有:

  • Levenshtein:54
  • Smith-Waterman Gotoh 49
  • Jaro Winkler 89

正如Anastasiosyal所指出的那样,SimMetrics具有这些算法的java代码。我使用SmithWatermanGotoh java code from SimMetrics成功了。

答案 4 :(得分:6)

这是我对Damerau Levenshtein Distance的实现,它不仅返回相似系数,还返回更正后的单词中的错误位置(此功能可用于文本编辑器)。此外,我的实现支持不同的错误权重(替换,删除,插入,转置)。

public static List<Mistake> OptimalStringAlignmentDistance(
  string word, string correctedWord,
  bool transposition = true,
  int substitutionCost = 1,
  int insertionCost = 1,
  int deletionCost = 1,
  int transpositionCost = 1)
{
    int w_length = word.Length;
    int cw_length = correctedWord.Length;
    var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1];
    var result = new List<Mistake>(Math.Max(w_length, cw_length));

    if (w_length == 0)
    {
        for (int i = 0; i < cw_length; i++)
            result.Add(new Mistake(i, CharMistakeType.Insertion));
        return result;
    }

    for (int i = 0; i <= w_length; i++)
        d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None);

    for (int j = 0; j <= cw_length; j++)
        d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None);

    for (int i = 1; i <= w_length; i++)
    {
        for (int j = 1; j <= cw_length; j++)
        {
            bool equal = correctedWord[j - 1] == word[i - 1];
            int delCost = d[i - 1, j].Key + deletionCost;
            int insCost = d[i, j - 1].Key + insertionCost;
            int subCost = d[i - 1, j - 1].Key;
            if (!equal)
                subCost += substitutionCost;
            int transCost = int.MaxValue;
            if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1])
            {
                transCost = d[i - 2, j - 2].Key;
                if (!equal)
                    transCost += transpositionCost;
            }

            int min = delCost;
            CharMistakeType mistakeType = CharMistakeType.Deletion;
            if (insCost < min)
            {
                min = insCost;
                mistakeType = CharMistakeType.Insertion;
            }
            if (subCost < min)
            {
                min = subCost;
                mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution;
            }
            if (transCost < min)
            {
                min = transCost;
                mistakeType = CharMistakeType.Transposition;
            }

            d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType);
        }
    }

    int w_ind = w_length;
    int cw_ind = cw_length;
    while (w_ind >= 0 && cw_ind >= 0)
    {
        switch (d[w_ind, cw_ind].Value)
        {
            case CharMistakeType.None:
                w_ind--;
                cw_ind--;
                break;
            case CharMistakeType.Substitution:
                result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution));
                w_ind--;
                cw_ind--;
                break;
            case CharMistakeType.Deletion:
                result.Add(new Mistake(cw_ind, CharMistakeType.Deletion));
                w_ind--;
                break;
            case CharMistakeType.Insertion:
                result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion));
                cw_ind--;
                break;
            case CharMistakeType.Transposition:
                result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition));
                w_ind -= 2;
                cw_ind -= 2;
                break;
        }
    }
    if (d[w_length, cw_length].Key > result.Count)
    {
        int delMistakesCount = d[w_length, cw_length].Key - result.Count;
        for (int i = 0; i < delMistakesCount; i++)
            result.Add(new Mistake(0, CharMistakeType.Deletion));
    }

    result.Reverse();

    return result;
}

public struct Mistake
{
    public int Position;
    public CharMistakeType Type;

    public Mistake(int position, CharMistakeType type)
    {
        Position = position;
        Type = type;
    }

    public override string ToString()
    {
        return Position + ", " + Type;
    }
}

public enum CharMistakeType
{
    None,
    Substitution,
    Insertion,
    Deletion,
    Transposition
}

此代码是我项目的一部分:Yandex-Linguistics.NET

我写了一些tests,在我看来这种方法正在发挥作用。

但欢迎提出意见和评论。

答案 5 :(得分:4)

这是另一种方法:

评论太长了。

找到相似性的典型方法是Levenshtein距离,毫无疑问是一个可用代码的库。

不幸的是,这需要与每个字符串进行比较。如果距离大于某个阈值,您可以编写一个专门版本的代码来使计算短路,您仍然需要进行所有比较。

另一个想法是使用一些三元组或n元语法的变体。这些是n个字符的序列(或n个字或n个基因组序列或n个)。保持三元组到字符串的映射,并选择具有最大重叠的那些。 n的典型选择是&#34; 3&#34;因此得名。

例如,英语会有这些三元组:

  • 主机
  • NGL
  • GLI
  • LIS
  • ISH

英格兰会:

  • 主机
  • NGL
  • GLA
  • LAN

嗯,7场比赛中有2场(或10投4中)。如果这适合您,您可以索引trigram / string表并获得更快的搜索。

你也可以将它与Levenshtein结合使用,以减少与那些具有最小n-gram数量的比较集。

答案 6 :(得分:0)

这是一个VB.net实现:

Public Shared Function LevenshteinDistance(ByVal v1 As String, ByVal v2 As String) As Integer
    Dim cost(v1.Length, v2.Length) As Integer
    If v1.Length = 0 Then
        Return v2.Length                'if string 1 is empty, the number of edits will be the insertion of all characters in string 2
    ElseIf v2.Length = 0 Then
        Return v1.Length                'if string 2 is empty, the number of edits will be the insertion of all characters in string 1
    Else
        'setup the base costs for inserting the correct characters
        For v1Count As Integer = 0 To v1.Length
            cost(v1Count, 0) = v1Count
        Next v1Count
        For v2Count As Integer = 0 To v2.Length
            cost(0, v2Count) = v2Count
        Next v2Count
        'now work out the cheapest route to having the correct characters
        For v1Count As Integer = 1 To v1.Length
            For v2Count As Integer = 1 To v2.Length
                'the first min term is the cost of editing the character in place (which will be the cost-to-date or the cost-to-date + 1 (depending on whether a change is required)
                'the second min term is the cost of inserting the correct character into string 1 (cost-to-date + 1), 
                'the third min term is the cost of inserting the correct character into string 2 (cost-to-date + 1) and 
                cost(v1Count, v2Count) = Math.Min(
                    cost(v1Count - 1, v2Count - 1) + If(v1.Chars(v1Count - 1) = v2.Chars(v2Count - 1), 0, 1),
                    Math.Min(
                        cost(v1Count - 1, v2Count) + 1,
                        cost(v1Count, v2Count - 1) + 1
                    )
                )
            Next v2Count
        Next v1Count

        'the final result is the cheapest cost to get the two strings to match, which is the bottom right cell in the matrix
        'in the event of strings being equal, this will be the result of zipping diagonally down the matrix (which will be square as the strings are the same length)
        Return cost(v1.Length, v2.Length)
    End If
End Function