Question

我采用以下方法计算多个文本文件文档之间的相似性：

对于给定目录中的每个文档，将文档分解为块（从基本滑动窗口算法计算的长字节序列）。
计算每个块的指纹，即使用诸如MD5
比较不同文件中出现的块

到目前为止，我已经实施了前两个步骤。我正在使用Google Guava的MultiMap将散列块与文件名相关联。至于第3步，我现在要做以下事情：

For each file {
   get the file's hashed chunks
   compare the hashed chunks with those of every other file
}

这里的想法是比较常见条目的文件签名，并仅报告其签名交集高于给定相似性阈值的那些文件的聚类。但我最终想调整这个逻辑，以便在将每个文件与源文件进行比较时获得相似性得分。

我的问题：比较第3步的散列块的最佳方法是什么？另外，我怎样才能将相似性的概念与前者相结合？这是我到目前为止的代码：

public class ContentBasedChunkingMain implements Similarity {

    Multimap<String, byte[]> hashMap = ArrayListMultimap.create();

    public ContentBasedChunkingMain() {
    }

    @Override
    public void getSimilarity(String dir) {

        // Document directory
        File directory = new File(dir);

        Collection<File> collection = FileUtils.listFiles(directory,
                TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE);

        ArrayList<File> files = new ArrayList<File>(Arrays.asList(collection
                .toArray(new File[0])));

        // For each file in the directory
        for (File f : files) {

            // Get chunks
            ArrayList<String> chunks = Chunker.getChunks(f);

            for (String s : sentences) {

                // MD5 Hash each sentence
                MD5HashFunction h = new MD5HashFunction();
                System.out.println(h.byteToHex(h.hash(s)));

                // Store filename and hashed chunk into MultiMap
                hashMap.put(f.getAbsolutePath(), h.hash(s));

            }

        }

        // Step 3    
        for (File f : files) {

            // for each file, get the file's hashed chunks
            Collection<byte[]> bytes = hashMap.get(f.getAbsolutePath());

            // compare ...

        }

    }

我所遵循的算法，基于内容的分块算法，在以下论文中有更详细的描述：

http://www.hpl.hp.com/techreports/2009/HPL-2009-90.pdf

http://webglimpse.net/pubs/TR93-33.pdf

http://www.hpl.hp.com/techreports/2005/HPL-2005-42R1.pdf

Answer 1

为了进行比较，我采用了两个哈希块集合的交集。

比较文档相似性/聚类的哈希列表

1 个答案: