Question

如何根据原始文本计算两个文本文件之间的交集？解决方案是使用shell命令还是用Python，Elisp或其他常用脚本语言表示无关紧要。

我知道comm和grep -Fxv -f file1 file2。两者都假设我对行的交集很感兴趣，而我对字符的交集很感兴趣（对于计算匹配所需的字符数最少）。

效率的奖励点。

示例

如果文件1包含

foo bar baz-fee

和文件2包含

fee foo bar-faa

然后我想看看

foo bar
fee

假设最小匹配长度为3。

Answer 1

您正在寻找Python的difflib模块（在标准库中），特别是difflib.SequenceMatcher。

Answer 2

好的，这是一个非常简单的python脚本来完成这个

它可以被改进，但应该做好。

TEMP.TXT

xx yy xyz zz aa
xx yy xyz zz   aa
xx yy xyz zz aa
xx yy 111   aa cc

temp2.txt

yy aa cc dd
ff xx ee 11
oo mm aa tt

common.py

#!/usr/bin/python
import sys

def main():
    f1,f2 = tryOpen(sys.argv[1],sys.argv[2])
    commonWords(f1,f2)
    f1.close()
    f2.close()

def tryOpen(fn1,fn2):
    try:
      f1 = open(fn1, 'r')
      f2 = open(fn2, 'r')
      return f1,f2
    except Exception as e:
      print('Oh No! => %s' %e)
      sys.exit(2) #Unix programs generally use 2 for 
                  #command line syntax errors
                  # and 1 for all other kind of errors.

def commonWords(f1,f2):

    words = []
    for line in f1:
      for word in line.strip().split():
            words.append(word)
    for line in f2:
        for word in line.strip().split():
            if word in words: print 'common word found => %s' % word    
if __name__ == '__main__':
    main()

输出

./common.py temp.txt temp2.txt
common word found => yy
common word found => aa
common word found => cc
common word found => xx
common word found => aa

Answer 3

您可以尝试使用差异选项：http://ss64.com/bash/diff.html

我仍然不清楚你究竟要求的是什么。你定义中的一个词是什么？这个交叉过程如何定义？

文本文件之间的交叉点

3 个答案: