我有两个文件。一个有两列,ref.txt。另一个有三列,file.txt。
在ref.txt中,
1 2
2 3
3 5
在file.txt中,
1 2 4 <---here matching
3 4 5
6 9 4
2 3 10 <---here matching
4 7 9
3 5 7 <---here matching
我想比较每个文件的两列,然后只打印与ref.txt匹配的file.txt中的行。
所以,输出应该是,
1 2 4
2 3 10
3 5 7
我认为两个词典比较,如
mydict = {}
mydict1 = {}
with open('ref.txt') as f1:
for line in f1:
key, key1 = line.split()
sp1 = mydict[key, key1]
with open('file.txt') as f2:
for lines in f2:
item1, item2, value = lines.split()
sp2 = mydict1[item1, item2]
if sp1 == sp2:
print value
如何将两个文件与字典或其他文件进行适当比较?
我找到了一些perl和python代码来解决两个文件中相同数量的列。
在我的例子中,一个文件有两列,另一个有三列。
如何比较两个文件并仅打印匹配值?
答案 0 :(得分:1)
这是一个经过修改和评论的版本,可以适用于您的大型数据集:
#read in your reference and the file
reference = open("ref.txt").read()
filetext = open("file.txt").read()
#split the reference file into a list of strings, splitting each time you encounter a new line
splitReference = reference.split("\n")
#do the same for the file
splitFile = filetext.split("\n")
#then, for each line in the reference,
for referenceLine in splitReference:
#split that line into a list of strings, splitting each time you encouter a stretch of whitespace
referenceCells = referenceLine.split()
#then, for each line in your 'file',
for fileLine in splitFile:
#split that line into a list of strings, splitting each time you encouter a stretch of whitespace
lineCells = fileLine.split()
#now, for each line in 'reference' check to see if the first value is equal to the first value of the current line in 'file'
if referenceCells[0] == lineCells[0]:
#if those are equal, then check to see if the current rows of the reference and the file both have a length of more than one
if len(referenceCells) > 1:
if len(lineCells) > 1:
#if both have a length of more than one, compare the values in their second columns. If they are equal, print the file line
if referenceCells[1] == lineCells[1]:
print fileLine
输出:
1 2 4
2 3 10
3 5 7
答案 1 :(得分:1)
grep -Ff ref.txt file.txt
如果两个文件中字符之间的空白量相同,则就足够了。如果不是,你可以做
awk '{print "^" $1 "[[:space:]]+" $2}' | xargs -I {} grep -E {} file.txt
结合了我最喜欢的三个实用程序:awk
,grep
和xargs
...后一种方法还可确保匹配仅发生在行的开头(比较列) 1列1列,列2列2列。
答案 2 :(得分:1)
这是另一种选择:
use strict;
use warnings;
my $file = pop;
my %hash = map { chomp; $_ => 1 } <>;
push @ARGV, $file;
while (<>) {
print if /^(\d+\s+\d+)/ and $hash{$1};
}
用法:perl script.pl ref.txt file.txt [>outFile]
最后一个可选参数将输出定向到文件。
数据集输出:
1 2 4
2 3 10
3 5 7
希望这有帮助!