如何计算linux上两个文件之间的差异?

时间:2009-10-14 14:06:21

标签: shell count diff

我需要使用大文件,必须找到两者之间的差异。我不需要不同的位,但需要差异的数量。

查找我想出的不同行数

diff --suppress-common-lines --speed-large-files -y File1 File2 | wc -l

它有效,但还有更好的方法吗?

如何计算差异的确切数量(使用标准工具,如bash,diff,awk,sed一些旧版本的perl)?

7 个答案:

答案 0 :(得分:44)

如果要计算不同的行数,请使用:

diff -U 0 file1 file2 | grep ^@ | wc -l

约翰的回答不是重复计算不同的行吗?

答案 1 :(得分:41)

diff -U 0 file1 file2 | grep -v ^@ | wc -l

diff列表顶部的两个文件名减去2。统一格式可能比并排格式快一点。

答案 2 :(得分:6)

如果使用Linux / Unix,那么comm -1 file1 file2如何在file1中打印不在file2中的行,comm -1 file1 file2 | wc -l来计算它们,以及类似于comm -2 ...呢?

答案 3 :(得分:5)

由于每个不同的输出行都以<>字符开头,我建议如下:

diff file1 file2 | grep ^[\>\<] | wc -l

只在脚本行中使用\<\>,您只能在其中一个文件中计算差异。

答案 4 :(得分:1)

我相信此answer中的正确解决方案是:

$ diff -y --suppress-common-lines a b | grep '^' | wc -l
1

答案 5 :(得分:0)

如果您正在处理具有类似内容的文件,这些文件应该按行进行排序(例如描述类似内容的CSV文件),例如,想要在以下文件中找到2个差异: 文件a:文件b: min,max min,max 1,5 2,5 3,4 3,4 -2,10 -1,1 你可以在Python中实现它,如下所示: different_lines = 0 open(file1)as a,open(file2)as b:     换行:         other_line = b.readline()         if line!= other_line:             different_lines + = 1

答案 6 :(得分:0)

这是一种计算两个文件之间任何类型的差异的方法,并为这些差异指定了正则表达式-这里Lambda用于表示除换行符以外的任何字符:

.

摘录自git diff --patience --word-diff=porcelain --word-diff-regex=. file1 file2 | pcre2grep -M "^@[\s\S]*" | pcre2grep -M --file-offsets "(^-.*\n)(^\+.*\n)?|(^\+.*\n)" | wc -l

man git-diff

--patience Generate a diff using the "patience diff" algorithm. --word-diff[=<mode>] Show a word diff, using the <mode> to delimit changed words. By default, words are delimited by whitespace; see --word-diff-regex below. porcelain Use a special line-based format intended for script consumption. Added/removed/unchanged runs are printed in the usual unified diff format, starting with a +/-/` ` character at the beginning of the line and extending to the end of the line. Newlines in the input are represented by a tilde ~ on a line of its own. --word-diff-regex=<regex> Use <regex> to decide what a word is, instead of considering runs of non-whitespace to be a word. Also implies --word-diff unless it was already enabled. Every non-overlapping match of the <regex> is considered a word. Anything between these matches is considered whitespace and ignored(!) for the purposes of finding differences. You may want to append |[^[:space:]] to your regular expression to make sure that it matches all non-whitespace characters. A match that contains a newline is silently truncated(!) at the newline. For example, --word-diff-regex=. will treat each character as a word and, correspondingly, show differences character by character. 是Ubuntu 20.04上pcre2grep软件包的一部分。