Question

我有两个文本文件，我需要比较其中的内容，因为其中一个文件缺少另一个项目，但我不确定它们是多长的。我试过diff和vimdiff没有运气。我的文件格式都是乱码的：

item1    item2    item3
item8    item10   item6
item32   item12   item7

如何在忽略格式和顺序的情况下，选择其中一个文本文件但另一个缺少的项目？

Answer 1

赛勒斯的例子远远更短，更重要，但我想我会练习一些（冗长的）awk ......

示例数据：

$ cat file1
         item2    item3
item8    item10   item6
item32   item12   item7

$ cat file2
item1    item2    item3
item8             item6
         item12   item7

假设：

虽然描述说我可能会从一个文件中遗漏一些项目，但我会假设两个文件中都可能缺少项目
不用担心排序（输入或输出）
没有关于如何显示输出的指导我只会自己做，包括显示项目缺失的文件的名称

一种可能的awk解决方案：

$ cat text.awk
BEGIN { RS="" }

NR==FNR { afile=FILENAME ; for (i=1;i<=NF;i++) a[$i]=1 ; next }
        { bfile=FILENAME ; for (i=1;i<=NF;i++) b[$i]=1        }

END {
    for (x in a)
        { if ( ! b[x] )
             { printf "missing from %s : %s\n",bfile,x }
        }
    for (x in b)
        { if ( ! a[x] )
             { printf "missing from %s : %s\n",afile,x }
        }
}

RS=""：将行分隔符（RS）设置为空字符串;这会将文件转换成一条长记录
NR==NFR：如果这是第一个（两个）文件......
afile=FILENAME：保存文件名以便以后打印
for/a[$i]=1：使用输入字段1-NF作为关联数组a的索引，将数组值设置为1（又名'true'）
next：读取下一条记录，在本例中为读取下一个文件
NR!=FNR：如果这是第二个（两个）文件......
除了填充bfile和关联数组b
END ...：处理我们的数组......
for (x in a)：循环遍历数组a的索引并分配给变量x，如果数组b中没有可比较的索引条目（! b[x] ）然后打印一条关于bfile
for (x in b)：与上一个循环相同，但检查bfile但不在afile

此awk脚本正在运行：

$ awk -f text.awk file1 file2
missing from file2 : item10
missing from file2 : item32
missing from file1 : item1

# switch the order of the input files => same messages, just different order
$ awk -f text.awk file2 file1
missing from file1 : item1
missing from file2 : item10
missing from file2 : item32

Answer 2

我相信你可以使用comm命令..但是你应该按顺序排列这两个文件进行比较：

comm -23 f1 f2 # will give whatever lines not matching in file1 against file2
comm -12 f1 f2 # will give matching lines
comm -13 f1 f2 # will give whatever lines not matching in file2 against file 1

Answer 3

使用comm来比较您的文件，以便找到其中常见或不同的内容。

$ cat file1
item1    item2    item3
item8    item10   item6
item32   item12   item5

$ cat file2
item1    item2    item3
item8    item15   item6
item32   item12   item7

comm -23 file1 file2返回file1但不在file2中的行 comm -13 file1 file2返回file2但不在file1中的行 comm -12 file1 file2返回两个文件中共同的行

comm要求对输入文件进行排序。我们将首先通过sed将spaces转换为\n，然后通过排序进行排序。

$ comm -23 <(sed 's/ \+/\n/g' file1 | sort ) <(sed 's/ \+/\n/g' file2 | sort)
item10
item5

$ comm -13 <(sed 's/ \+/\n/g' file1 | sort ) <(sed 's/ \+/\n/g' file2 | sort)
item15
item7

$ comm -12 <(sed 's/ \+/\n/g' file1 | sort ) <(sed 's/ \+/\n/g' file2 | sort)
item1
item12
item2
item3
item32
item6
item8

- 我的答案在此结束。 ---

但仅仅是为了获取信息，comm的手册页说：

   With no options, comm produce three-column output.  Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.

   -1     suppress column 1 (lines unique to FILE1)

   -2     suppress column 2 (lines unique to FILE2)

   -3     suppress column 3 (lines that appear in both files)

因此：

$ comm  <(sed 's/ \+/\n/g' file1 | sort ) <(sed 's/ \+/\n/g' file2 | sort)
                item1
item10
                item12
        item15
                item2
                item3
                item32
item5
                item6
        item7
                item8

比较忽略顺序和格式的文本文件的内容

3 个答案: