Question

我计划在最后一年建立一个与相似性检查器相似的项目。在该项目中，我计划检查提交的作业（即脱机）之间的相似度百分比。

例如：

第一个学生提交作业时，不会与其他任何作业一起检查。
第二个学生提交作业时，将与第一个作业一起检查。
当第三名学生提交作业时，将同时检查第一和第二份作业。
类似地，如果有35名学生，则第35项提交的作业将与其余35项提交的作业进行核对。

现在，出现了一个问题，即如何比较两个作业。在这种情况下，比较是文档中文本之间的相似性。我想要类似的结果：

我只想显示相似句子的百分比及其含义？

我做了什么：

我研究了td-idf，余弦相似度算法等不同算法，但无法正确插补算法的结果。

因此，我想知道哪种算法在这种情况下是最好的，并且我想知道如何实现。是否有对任何网站或博客有帮助的参考？

Answer 1

这取决于您使用的算法如何返回比较结果。

例如，以下函数比较文档内容列表，并返回映射到它们之间的公共单词序列列表的文档对字典。它不能区分彼此包含的单词序列，因为更长和较短的单词序列重叠的次数可能不同。

import re
from itertools import combinations

def wordList(document): return re.findall("(\w+|\d+)",document.lower())

def compareDocs(documents, minSize=2, maxSize=25):
    result  = dict() # { (documentIndex,documentIndex) : [CommonExpressions] }
    def tallyDuplicates(expressionDocs):
        for expression,docIndexes in expressionDocs.items():
            for docIndex,otherDoc in combinations(docIndexes,2):
                result.setdefault((docIndex,otherDoc),[]).append(expression)

    documentWords    = [ wordList(document) for document in documents ]
    wordCounts       = [ len(words) for words in documentWords ]
    expressionRanges = dict()
    for docIndex,words in enumerate(documentWords):
        for wordIndex,word in enumerate(words):
            expressionRanges.setdefault(word,[]).append((docIndex,wordIndex))

    size = 1    
    while size == 1 or expressionDocs and size <= maxSize:        
        nextExpressions   = dict()
        expressionDocs    = dict()
        for expression,starts in expressionRanges.items():
            for docIndex,startIndex in starts:
                endIndex = startIndex+size
                if endIndex >= wordCounts[docIndex]: continue
                extended = " ".join([expression,documentWords[docIndex][endIndex]])
                expressionDocs.setdefault(extended,set()).add(docIndex)
                nextExpressions.setdefault(extended,[]).append( (docIndex,startIndex) )
        expressionDocs   = { expression:docIndexes for expression,docIndexes in expressionDocs.items() if len(docIndexes) > 1 }
        expressionRanges = { expression:ranges for expression,ranges in nextExpressions.items() if expression in expressionDocs }  
        if size >= minSize: tallyDuplicates(expressionDocs)
        size += 1

    return result

基于这些比较结果，您需要分析每个文档对的内容以对公共表达式（单词序列）所覆盖的单词进行计数。假定一个表达式包含多个单词，则每个表达式将以相似率说明多个单词：匹配表达式中的单词/文档中的单词。

[EDIT]我将结果分析放在自己的函数中，并添加了html输出以突出显示文档文本中的表达式：

def analyzeComparison(doc1,doc2,commonExpr):
    words1  = wordList(doc1)
    words2  = wordList(doc2)
    normalizedDoc1 = " ".join(words1)
    normalizedDoc2 = " ".join(words2)
    expressions.sort(key=lambda s:len(s),reverse=True)
    matches = []
    for expression in expressions:
        count1 = len(re.findall(expression,normalizedDoc1))
        count2 = len(re.findall(expression,normalizedDoc2))
        commonCount = min(count1,count2)
        if commonCount == 0: continue
        expressionId = "<#"+str(len(matches))+"#>"
        normalizedDoc1 = re.sub(expression,expressionId,normalizedDoc1,commonCount)
        normalizedDoc2 = re.sub(expression,expressionId,normalizedDoc2,commonCount)
        matches.append((expression,commonCount))
    commonWords = sum( count*len(expr.split(" ")) for expr,count in matches)
    percent1 = 100*commonWords/len(words1)
    percent2 = 100*commonWords/len(words2)
    for index,match in enumerate(matches):
        expressionId = "<#"+str(index)+"#>"
        expressionHighlight = "<span style='background-color:yellow'>"+match[0]+"</span>"
        normalizedDoc1 = re.sub(expressionId,expressionHighlight,normalizedDoc1)
        normalizedDoc2 = re.sub(expressionId,expressionHighlight,normalizedDoc2)
    return (percent1,percent2,matches,normalizedDoc1,normalizedDoc2)

例如：如果您拥有以下3个文档（通常会从文件中读取它们）：

doc1 = """
Plagiarism, one of the main scourges of the academic life, is quite an easy concept, but, nonetheless, harmful. In short, to plagiarize means to steal someone else’s idea or part of work and use it as your own. But why exactly it is considered to be so bad and immoral? And it is really considered immoral and a serious offence. In case it is discovered, it may lead to very unpleasant consequences; the higher the position of the offender is, the more unpleasant they are.
copy and paste
There are two major kinds of harm plagiarism causes. First, it is something as simple as stealing and lying – you just steal someone else’s work and trick somebody into believing it was you who had written it, which is as immoral as any other kind of theft is. It means that somebody had actually spent time and effort in order to create something, while you did nothing but ripping it off and submitting it.
copy and paste function
Second, it is a crime you commit against yourself. If you study at an educational institution, there are certain tasks copy and paste you are given in order to ensure that you learn something. When you resort to plagiarism, you undo all these efforts for, instead of actually doing something and understanding it in process, you use someone else’s work and the certain amount of experience that you were supposed to get just misses you.
"""
doc2 = """
Plagiarism has always been a problem in schools. However, with the invention of the internet,copy and paste  it has made plagiarism even more of a challenge. Plagiarism.org, “estimates that nearly 30 percent of all students may be plagiarizing on all their written assignments and that the use of the Internet has made plagiarism much worse.” [1] The act of plagiarism can be defined as, “To steal and pass off (the ideas or words of another) as one’s own, to use (another’s production) without crediting the source, to commit literary theft, to present as new and original as idea or product derived from an existing source”2. Plagiarism has become such a concern for colleges that almost all the sites on this topic are sponsored by schools. The three main topics with plagiarism are the copy and paste function, “paper mills” and the ways that can be used to prevent students from doing this. 
it is quite an easy concept
The first major concern with the internet would be the copy and paste function. Wittenberg copy and paste function lists that “Widespread availability of the internet and increased access to full text databases has made cut and paste plagiarism very easy”.3 While the function is actually very nice to have, people are using it the wrong way. Instead of just using it to copy quotes from websites, than pasting it to their word document and giving it the proper credit, people are passing it off as their own. This is where the problem occurs.
"""

doc3 = """
Plagiarism has always been a problem in schools. However, it is something as simple as stealing and lying
it is a crime you. some other text
"""

您首先要在文档内容列表上调用compareDocs（），对于每对文档（由函数返回），您将使用analyzeComparison（）来获取百分比，计数和突出显示：

documents   = [doc1,doc2,doc3]
comparisons = compareDocs( documents )
for documentPair,expressions in comparisons.items():
    docIndex1,docIndex2 = documentPair
    doc1 = documents[docIndex1]
    doc2 = documents[docIndex2]        
    pct1,pct2,matches,doc1,doc2 = analyzeComparison(doc1,doc2,expressions)

    # print result on console ...
    print(int(pct1//1)," % of document #",docIndex1," is same as document #", docIndex2)
    print(int(pct2//1)," % of document #",docIndex2," is same as document #", docIndex1)
    print("Common expressions are:")
    for expression,count in matches:
        print( "    ",expression,"(",count,"times )")
    print("")

    # output comparison result to an HTML file...
    htmlPage = "<html><body><table border='1'>"
    htmlPage += "<tr><th>#" + str(docIndex1) + ": Source " + str(int(pct1//1)) + "% duplicate</th>"
    htmlPage += "<th>#" + str(docIndex2) + ": Target  " + str(int(pct2//1)) + "% duplicate</th></tr>"
    htmlPage += "<tr><td width='50%' valign='top'>" + doc1 + "</td><td valign='top'>" + doc2 + "</td></tr>"
    htmlPage +="</table></body></html>"        
    fileName = str(docIndex1)+"-"+str(docIndex2)+".html"
    with open(fileName,"w") as f: f.write(htmlPage)

这会打印以下信息，并创建一堆看起来与您期望的结果相似的HTML文件：

3.0  % of document # 1  is same as document # 2
34.0  % of document # 2  is same as document # 1
Common expressions are:
     plagiarism has always been a problem in schools however ( 1 times )

6.0  % of document # 0  is same as document # 1
5.0  % of document # 1  is same as document # 0
Common expressions are:
     is quite an easy concept ( 1 times )
     copy and paste function ( 1 times )
     copy and paste ( 2 times )

5.0  % of document # 0  is same as document # 2
53.0  % of document # 2  is same as document # 0
Common expressions are:
     it is something as simple as stealing and lying ( 1 times )
     it is a crime you ( 1 times )

总而言之，整个过程如下：

1）运行比较功能，以识别每对文档共用的表达方式（单词顺序）。

在给出文档文本列表的情况下，compareDocs函数一次调用即可完成此操作。
如果您使用其他比较算法，则可以将其设计为仅在两个文档之间执行比较，或者，对于分类器，它可以简单地返回一个文档的单词/术语频率列表。
取决于算法的输入和输出，您需要或多或少地将逻辑包装在自己的代码中以获得所需的结果
在此阶段您应该寻找的是不同文档对之间的常用表达（单词序列）列表。
如果您正在使用仅提取术语频率的算法（例如td-idf），那么您将面临一个复杂度很高的问题，即需要交叉匹配文档对之间的术语频率。

例如，分类器可以返回频率：给定文档的“ cut” = 25次，“ and” = 97次“ paste” = 31次。这不会给您任何指示，表明“剪切和粘贴”这个表达实际上存在于文档中或存在于文档中的次数。该文档可能是在谈论牙膏，而从不按顺序排列这三个词。仅基于词频比较文档会发现同一主题的文章之间具有高度相关性，但这并不意味着存在窃。

此外，即使您的分类器设法返回两个或更多单词的所有表达式，每个文档也会产生接近w * 2 ^ n的表达式，其中w是文档中单词的数量，n是表达式的最大长度字数（您必须决定的最大值）。这很容易使每个文档达到数百万个，然后您需要将它们与其他文档中的数百万个匹配。如果您拥有Google的资源，这可能不是问题，但这将是我们其他人的需求。

2）要衡量文档之间相似度的百分比，您将需要在两边找到共同的表达方式，并测量共同表达方式涵盖每个文档中有多少个单词。

确定表达式的位置是一个简单的文本搜索过程
但是，您将需要避免多次计数任何给定的单词，因为百分比的分母是文档中单词的数量（并且您不想高估或超过100％）
这可以通过首先处理较长的表达式并将其从文本中删除（或屏蔽它们）来实现，以免后续（较短的）表达式不再计算它们的单词
analyticsComparison（）函数通过将占位符替换为占位符来掩盖在文本中找到的表达式，该占位符随后将用于使用突出显示标签（HTML）重新注入文本。

3）在您自己的程序中使用文档比较分析。这取决于您要如何显示信息以及是否需要存储结果（取决于您）。例如，您可以确定相似性的阈值，而仅输出可疑的文档对。此阈值可以基于百分比，常用表达式的数量，常用表达式的最大或平均长度等。

[EDIT2] compareDocs的工作原理...

该函数创建一个表达式词典，将它们映射到每个文档中第一个单词的位置。这存储在expressionRanges变量中。
- 示例：{“复制并粘贴”：[（0,57），（1,7），（1,32）] ....}
- 这意味着在文档＃0的位置57（单词“副本”的位置）和文档＃1的位置7和32中找到了三个单词的表达“复制和粘贴”。
表达式字典（expressionRanges）从一个单字的表达式开始，并使用它来获取2字的表达式，然后是3字，依此类推。
在继续使用下一个表达式大小之前，通过删除仅在一个文档中找到的所有表达式来清理表达式字典。
- size 1 ==> {“ copy”：[（0,57），（0,72），（1,7），（1,32），（1,92）] ...}
- 清理...
- 大小2 ==> {“复制和”：[（0,57），（1,7），（1,32），（1,92）] ...}
- 清理...
- 大小3 ==> {“复制并粘贴”：[（0,57），（1,7），（1,32）] ...}
此清理是通过创建一个单独的字典（expressionDocs）来实现的，该字典将表达式映射到包含该表达式的一组文档索引。集合中仅包含一个文档的表达式将从两个字典中删除。
expressionDocs词典也用于生成函数的输出。出现在多个文档中的表达式被映射到文档对（2的组合）以形成一个包含以下内容的字典：{（文档对）：[表达式列表]}（是函数的结果）
tallyDuplicates子功能通过将表达式添加到文档索引列表中2的每个组合中来执行从{Expression：[文档索引列表]}到{（文档对）：[表达式列表]}的转换。。

expressionRanges的连续细化大大减少了要执行的单词匹配的次数。每遍仅向每个表达式添加一个单词，并在清除下一个表达式大小之前立即清除。 expressionRanges字典的开头与文档中存在不同单词的条目一样多，但很快就缩小为较小的字体（除非文档实际上是相同的）。

此方法的一个缺点是，具有大量非常长的匹配表达式的文档将导致字典增加而不是缩小，而while循环将运行更长的时间。最坏的情况是两个相同的文档。通过引入最大的表达式大小以使循环更早地停止，可以避免这种情况。例如，如果将最大大小设置为25，则该函数将仅报告25个单词的通用表达式和5个单词的通用表达式，而不是30个单词的通用表达式。为了避免几乎相同的文档需要很长的处理时间，这可能是可以接受的折衷方案。就相似百分比而言，差异将很小。（即，如果有一个26个单词的通用表达式，最大为25个，而一个27个单词的表达式将匹配为25个单词和2个单词，则可以忽略一个公共单词）

在提交的作业中检查相似度百分比的最佳算法是什么？

1 个答案: