如何在不使用将所有数据帧读取到内存的熊猫合并的情况下对多个文件进行逐行合并

时间:2020-01-02 12:02:22

标签: python pandas dask

在与该文件比较后,我希望基于2列匹配将多个文件与单个(f1.txt)文件合并。我可以在大熊猫中做到这一点,但它会将所有内容读取到内存中,而这会变得非常快。我认为逐行读取不会将所有内容加载到内存中。熊猫现在也不是一个选择。在未与f1.txt匹配的单元格中填充空值时如何执行操作?

在这里,我使用了一个字典,我不确定它是否会保存在内存中,而且在其他文件中没有匹配项的f1.txt中,我也找不到添加null的方法。其他文件最多可以包含1000个不同的文件。时间不重要,只要我不将所有内容都读到内存中即可

文件(制表符分隔)

f1.txt
A B  num  val scol
1 a1 1000 2 3
2 a2 456 7 2
3 a3 23 2 7
4 a4 800 7 3
5 a5 10 8 7

a1.txt
A B num val scol fcol dcol
1 a1 1000 2 3 0.2 0.77
2 a2 456 7 2 0.3 0.4
3 a3 23 2 7 0.5 0.6
4 a4 800 7 3 0.003 0.088

a2.txt
A B num val scol fcol2 dcol1
2 a2 456 7 2 0.7 0.8
4 a4 800 7 3 0.9 0.01
5 a5 10 8 7 0.03 0.07

当前代码

import os
import csv
m1 = os.getcwd() + '/f1.txt'
files_to_compare = [i for i in os.listdir('dir')]
dictionary = dict()
dictionary1 = dict()
with open(m1, 'rt') as a:
    reader1 = csv.reader(a, delimiter='\t')
    for x in files_to_compare:
        with open(os.getcwd() + '/dir/' + x, 'rt') as b:
            reader2 = csv.reader(b, delimiter='\t')
            for row1 in list(reader1):              
                dictionary[row1[0]] = list()
                dictionary1[row1[0]] = list(row1)
            for row2 in list(reader2):
                try:
                    dictionary[row2[0]].append(row2[5:])
                except KeyError:
                    pass
print(dictionary)
print(dictionary1)

我要实现的目标类似于使用:df.merge(df1,on = ['A','B'],how ='left')。fillna('null')

current result
{'A': [['fcol1', 'dcol1'], ['fcol', 'dcol']], '1': [['0.2', '0.77']], '2': [['0.7', '0.8'], ['0.3', '0.4']], '3': [['0.5', '0.6']], '4': [['0.9', '0.01'], ['0.003', '0.088']], '5': [['0.03', '0.07']]}

{'A': ['A', 'B', 'num', 'val', 'scol'], '1': ['1', 'a1', '1000', '2', '3'], '2': ['2', 'a2', '456', '7', '2'], '3': ['3', 'a3', '23', '2', '7'], '4': ['4', 'a4', '800', '7', '3'], '5': ['5', 'a5', '10', '8', '7']}
Desired result
{'A': [['fcol1', 'dcol1'], ['fcol', 'dcol']], '1': [['0.2', '0.77'],['null', 'null']], '2': [['0.7', '0.8'], ['0.3', '0.4']], '3': [['0.5', '0.6'],['null', 'null']], '4': [['0.9', '0.01'], ['0.003', '0.088']], '5': [['null', 'null'],['0.03', '0.07']]}

{'A': ['A', 'B', 'num', 'val', 'scol'], '1': ['1', 'a1', '1000', '2', '3'], '2': ['2', 'a2', '456', '7', '2'], '3': ['3', 'a3', '23', '2', '7'], '4': ['4', 'a4', '800', '7', '3'], '5': ['5', 'a5', '10', '8', '7']}

我的最终目的是将字典写入文本文件。我不知道将使用多少内存,或者它是否适合内存。如果有不使用熊猫的更好的方法,那将是很好的,否则我将如何使字典工作?

任务尝试

import dask.dataframe as dd    
directory = 'input_dir/'
first_file = dd.read_csv('f1.txt', sep='\t')
df = dd.read_csv(directory + '*.txt', sep='\t')
df2 = dd.merge(first_file, df, on=[A, B])

I kept getting ValueError: Metadata mismatch found in 'from_delayed' 
+-----------+--------------------+
| column    |  Found  | Expected |
+--------------------------------+
| fcol      |  int64  | float64  |
+-----------+--------------------+

我用Google搜索,发现了类似的投诉,但无法解决。这就是为什么我决定尝试这一点的原因。检查了我的文件,所有dtypes似乎都一致。我的dask版本是2.9.1

1 个答案:

答案 0 :(得分:1)

如果要使用手工制作解决方案,可以查看heapq.mergeitertools.groupby。假设您的文件按前两列(键)排序

我举了一个简单的示例,将文件合并和分组并生成两个文件,而不是字典(因此(几乎)什么都没有存储在内存中,所有内容都在磁盘上读/写):

from heapq import merge
from itertools import groupby

first_file_name = 'f1.txt'
other_files = ['a1.txt', 'a2.txt']

def get_lines(filename):
    with open(filename, 'r') as f_in:
        for line in f_in:
            yield [filename, *line.strip().split()]

def get_values(lines):
    for line in lines:
        yield line
    while True:
        yield ['null']

opened_files = [get_lines(f) for f in [first_file_name] + other_files]

# save headers
headers = [next(f) for f in opened_files]

with open('out1.txt', 'w') as out1, open('out2.txt', 'w') as out2:
    # print headers to files
    print(*headers[0][1:6], sep='\t', file=out1)

    new_header = []
    for h in headers[1:]:
        new_header.extend(h[6:])

    print(*(['ID'] + new_header), sep='\t', file=out2)

    for v, g in groupby(merge(*opened_files, key=lambda k: (k[1], k[2])), lambda k: (k[1], k[2])):
        lines = [*g]

        print(*lines[0][1:6], sep='\t', file=out1)

        out_line = [lines[0][1]]
        iter_lines = get_values(lines[1:])
        current_line = next(iter_lines)
        for current_file in other_files:
            if current_line[0] == current_file:
                out_line.extend(current_line[6:])
                current_line = next(iter_lines)
            else:
                out_line.extend(['null', 'null'])
        print(*out_line, sep='\t', file=out2)

产生两个文件:

out1.txt

A   B   num val scol
1   a1  1000    2   3
2   a2  456 7   2
3   a3  23  2   7
4   a4  800 7   3
5   a5  10  8   7

out2.txt

ID  fcol    dcol    fcol2   dcol1
1   0.2 0.77    null    null
2   0.3 0.4 0.7 0.8
3   0.5 0.6 null    null
4   0.003   0.088   0.9 0.01
5   null    null    0.03    0.07
相关问题