Question

有没有办法匹配文件中的单词列表。我有两个文件，A和B. A有一个单词列表

A
abcd
xyzt

文件B

B
abcdefgh abcd
abcdytqw wert
zswertyu xyzt

我想从fileB中提取第1行和第3行。我想将A匹配到B的第二列，如果匹配打印B行。

输出

abcdefgh abcd
zswertyu xyzt

我在for循环中使用perl中的grep尝试了这个。但它太慢了。我有超过100K的名单。

Answer 1

这样可以将所有A加载到一个集合中以加快速度。如果你没有将A加载到内存中，那么你必须将A的每一行与整个文件B进行比较。通过将A加载到内存中，您只需要遍历每个文件一次。此外，由于A在内存中，因此检查B的第二列是否在A中会更快。

以下是python中的一个示例：

#!/usr/bin/env python

def load_data(filename):
    with open(filename, 'r') as infile:
        Aset = set()
        for line in infile:
            word = line.strip()
            if word == '':
                continue
            Aset.add(word)
    return Aset

if __name__ == '__main__':
    Aset = load_data('A')

    with open('B', 'r') as infile:
        for line in infile:
            # Assumes that each line in B will have at least 2 columns.
            # And that the column you are checking against is the last.
            word = line.strip().split()[-1]
            if word in Aset:
                print line.strip()

如果机器没有足够的（空闲）内存来将所有文件A加载到集合中，则无效。

从单词列表中查找模式

1 个答案: