Question

我正在研究混合代码以比较近乎重复的代码。我在比较代码上有些卡住了。到目前为止，这是我的粗略尝试。

//shingles are already hashed integers and I'm working on the evaluation to true via the float similar parameter.
public static boolean compareShingles(float similar, CompareObject comp1, CompareObject comp2) {
        int intersections = 0;
        if(comp1.getShingle().size()>=comp2.getShingle().size()){
        for(int i = 0; i < comp1.getShingle().size(); i++){

              if(comp1.getShingle().get(i).equals(comp2.getShingle().get(i))){
              intersections++;
              }

        }
        }
        else{
              for(int i = 0; i < comp2.getShingle().size(); i++){
                    if(comp2.getShingle().get(i).equals(comp1.getShingle().get(i))){
                    intersections++;
                    }

              }
        }
        return true; //not functional still working on when to return true
  }

如果我应该在数组中比较这些带状疱疹1-1，还是应该在循环中比较一个带状疱疹和所有带状疱疹，我会有些困惑。

例如，如果我循环比较每个瓦片与其他瓦片，则这些文档将是相同的...

{blah blah blah, Once upon a, time blah blah}
{Once upon a, time blah blah, blah blah blah}

如果我在同一文档上进行了位置比较，则位置1将是“等等等等”，而不是“一旦出现a”，则返回false。

我认为循环会更加耗时，但它可能是正确的选择。有想法吗？

Answer 1

顺序无关紧要。

基本上，您需要制作带状瓦的地板并将其与“ Jaccard相似度”进行比较。使用散列来自动丢弃重复的带状疱疹会有所帮助。只需计算每个文档之间的匹配数，然后找出需要匹配的数量以使它们相似即可。

http://ethen8181.github.io/machine-learning/clustering_old/text_similarity/text_similarity.html

比较带状疱疹以进行近乎重复的检测

1 个答案: