Java中的相似性字符串比较

时间:2009-06-05 09:55:00

标签: java string-comparison

我想比较几个字符串,找到最相似的字符串。我想知道是否有任何库,方法或最佳实践会返回我哪些字符串更类似于其他字符串。例如:

  • “快速狐狸跳” - > “狐狸跳了起来”
  • “快速狐狸跳” - > “狐狸”

这种比较会返回第一个比第二个更相似。

我想我需要一些方法,例如:

double similarityIndex(String s1, String s2)

某处有这样的事吗?

编辑:为什么我这样做?我正在编写一个脚本,将MS Project文件的输出与处理任务的某些遗留系统的输出进行比较。由于遗留系统的字段宽度非常有限,因此在添加值时,将缩写描述。我想要一些半自动的方式来查找MS Project中哪些条目与系统上的条目类似,这样我就可以获得生成的密钥。它有缺点,因为它必须仍然手动检查,但它会节省大量的工作

11 个答案:

答案 0 :(得分:139)

在许多图书馆中使用以0%-100%的方式计算两个字符串之间的相似性的常用方法是测量您需要更改多少(以%为单位)较长的字符串将其变为较短的字符串:

/**
 * Calculates the similarity (a number within 0 and 1) between two strings.
 */
public static double similarity(String s1, String s2) {
  String longer = s1, shorter = s2;
  if (s1.length() < s2.length()) { // longer should always have greater length
    longer = s2; shorter = s1;
  }
  int longerLength = longer.length();
  if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
  return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below


计算editDistance()

上面的editDistance()函数可以计算两个字符串之间的编辑距离。这一步有several implementations,每个都可能更适合特定情况。最常见的是 Levenshtein distance algorithm ,我们将在下面的示例中使用它(对于非常大的字符串,其他算法可能会表现得更好)。< / p>

以下是计算编辑距离的两个选项:


工作示例:

See online demo here.

public class StringSimilarity {

  /**
   * Calculates the similarity (a number within 0 and 1) between two strings.
   */
  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { // longer should always have greater length
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
    /* // If you have Apache Commons Text, you can use it to calculate the edit distance:
    LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
    return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  // Example implementation of the Levenshtein Edit Distance
  // See http://rosettacode.org/wiki/Levenshtein_distance#Java
  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

  public static void printSimilarity(String s, String t) {
    System.out.println(String.format(
      "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
  }

  public static void main(String[] args) {
    printSimilarity("", "");
    printSimilarity("1234567890", "1");
    printSimilarity("1234567890", "123");
    printSimilarity("1234567890", "1234567");
    printSimilarity("1234567890", "1234567890");
    printSimilarity("1234567890", "1234567980");
    printSimilarity("47/2010", "472010");
    printSimilarity("47/2010", "472011");
    printSimilarity("47/2010", "AB.CDEF");
    printSimilarity("47/2010", "4B.CDEFG");
    printSimilarity("47/2010", "AB.CDEFG");
    printSimilarity("The quick fox jumped", "The fox jumped");
    printSimilarity("The quick fox jumped", "The fox");
    printSimilarity("kitten", "sitting");
  }

}

<强>输出:

1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"

答案 1 :(得分:77)

是的,有许多记录良好的算法,如:

  • 余弦相似度
  • Jaccard相似度
  • 骰子系数
  • 匹配相似度
  • 重叠相似性
  • 等等

另外you can check this

还检查这些项目:

答案 2 :(得分:13)

我将Levenshtein distance algorithm翻译成了JavaScript:

String.prototype.LevenshteinDistance = function (s2) {
    var array = new Array(this.length + 1);
    for (var i = 0; i < this.length + 1; i++)
        array[i] = new Array(s2.length + 1);

    for (var i = 0; i < this.length + 1; i++)
        array[i][0] = i;
    for (var j = 0; j < s2.length + 1; j++)
        array[0][j] = j;

    for (var i = 1; i < this.length + 1; i++) {
        for (var j = 1; j < s2.length + 1; j++) {
            if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
            else {
                array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
                array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
            }
        }
    }
    return array[this.length][s2.length];
};

答案 3 :(得分:11)

您可以使用Levenshtein距离来计算两个字符串之间的差异。 http://en.wikipedia.org/wiki/Levenshtein_distance

答案 4 :(得分:9)

确实存在很多字符串相似性度量:

  • Levenshtein编辑距离;
  • Damerau-Levenshtein距离;
  • Jaro-Winkler的相似性;
  • 最长公共子序列编辑距离;
  • Q-Gram(Ukkonen);
  • n-Gram距离(Kondrak);
  • Jaccard指数;
  • Sorensen-Dice系数;
  • 余弦相似度;
  • ...

你可以在这里找到解释和java实现: https://github.com/tdebatty/java-string-similarity

答案 5 :(得分:5)

您可以使用apache commons java library来实现此目的。看看其中的这两个功能:
  - getLevenshteinDistance
  - getFuzzyDistance

答案 6 :(得分:3)

理论上,您可以比较edit distances

答案 7 :(得分:3)

这通常使用edit distance指标来完成。搜索“编辑距离java”会显示许多库,例如this one

答案 8 :(得分:3)

如果您的字符串变成文档,对我来说就像plagiarism finder。也许用这个术语进行搜索会产生好的结果。

“编程集体智慧”有一章确定两个文件是否相似。代码是用Python编写的,但它很干净且易于移植。

答案 9 :(得分:3)

感谢第一个回答者,我认为有两个计算的computeEditDistance(s1,s2)。由于花费了大量时间,决定提高代码的性能。所以:

public class LevenshteinDistance {

public static int computeEditDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
        int lastValue = i;
        for (int j = 0; j <= s2.length(); j++) {
            if (i == 0) {
                costs[j] = j;
            } else {
                if (j > 0) {
                    int newValue = costs[j - 1];
                    if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
                        newValue = Math.min(Math.min(newValue, lastValue),
                                costs[j]) + 1;
                    }
                    costs[j - 1] = lastValue;
                    lastValue = newValue;
                }
            }
        }
        if (i > 0) {
            costs[s2.length()] = lastValue;
        }
    }
    return costs[s2.length()];
}

public static void printDistance(String s1, String s2) {
    double similarityOfStrings = 0.0;
    int editDistance = 0;
    if (s1.length() < s2.length()) { // s1 should always be bigger
        String swap = s1;
        s1 = s2;
        s2 = swap;
    }
    int bigLen = s1.length();
    editDistance = computeEditDistance(s1, s2);
    if (bigLen == 0) {
        similarityOfStrings = 1.0; /* both strings are zero length */
    } else {
        similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
    }
    //////////////////////////
    //System.out.println(s1 + "-->" + s2 + ": " +
      //      editDistance + " (" + similarityOfStrings + ")");
    System.out.println(editDistance + " (" + similarityOfStrings + ")");
}

public static void main(String[] args) {
    printDistance("", "");
    printDistance("1234567890", "1");
    printDistance("1234567890", "12");
    printDistance("1234567890", "123");
    printDistance("1234567890", "1234");
    printDistance("1234567890", "12345");
    printDistance("1234567890", "123456");
    printDistance("1234567890", "1234567");
    printDistance("1234567890", "12345678");
    printDistance("1234567890", "123456789");
    printDistance("1234567890", "1234567890");
    printDistance("1234567890", "1234567980");

    printDistance("47/2010", "472010");
    printDistance("47/2010", "472011");

    printDistance("47/2010", "AB.CDEF");
    printDistance("47/2010", "4B.CDEFG");
    printDistance("47/2010", "AB.CDEFG");

    printDistance("The quick fox jumped", "The fox jumped");
    printDistance("The quick fox jumped", "The fox");
    printDistance("The quick fox jumped",
            "The quick fox jumped off the balcany");
    printDistance("kitten", "sitting");
    printDistance("rosettacode", "raisethysword");
    printDistance(new StringBuilder("rosettacode").reverse().toString(),
            new StringBuilder("raisethysword").reverse().toString());
    for (int i = 1; i < args.length; i += 2) {
        printDistance(args[i - 1], args[i]);
    }


 }
}

答案 10 :(得分:0)

您还可以使用z算法在字符串中查找相似性。点击此处https://teakrunch.com/2020/05/09/string-similarity-hackerrank-challenge/