Python:将对象列表解压缩到Dictionary

时间:2016-08-22 19:35:39

标签: python numpy dictionary multiprocessing

我有一个对象列表需要有效地解压缩到字典中。列表中有超过2,000,000个对象。该操作需要超过1.5小时才能完成。我想知道这是否可以更有效地完成。 列表中的对象基于此类。

class ResObj:
def __init__(self, index, result):
    self.loc = index ### This is the location, where the values should go in the final result dictionary
    self.res = result ### This is a dictionary that has values for this location.

    self.loc = 2
    self.res = {'value1':5.4, 'value2':2.3, 'valuen':{'sub_value1':4.5, 'sub_value2':3.4, 'sub_value3':7.6}}

目前我使用此方法执行此操作。

def make_final_result(list_of_results):
    no_sub_result_variables = ['value1', 'value2']
    sub_result_variables = ['valuen']
    sub_value_variables = ['sub_value1', 'sub_value3', 'sub_value3']

    final_result = {}
    num_of_results = len(list_of_results)
    for var in no_sub_result_variables:
        final_result[var] = numpy.zeros(num_of_results)
    for var in sub_result_variables:
        final_result[var] = {sub_var:numpy.zeros(num_of_results) for sub_var in sub_value_variables}

    for obj in list_of_results:
        i = obj.loc
        result = obj.res
        for var in no_sub_result_variables:
            final_result[var][i] = result[var]
        for var in sub_result_variables:
            for name in sub_value_variables:
                try:
                    final_result[var][name][i] = result[var][name]
                except KeyError as e:
                    ##TODO Add a debug check
                    pass

我尝试使用multiprocessing.Manager()。dict和Manager()。Array()为此使用并行,但是,我只能使用2个进程(尽管我手动将进程设置为#of CPU = 24)。 你能帮我用更快的方法来提高性能吗? 谢谢。

3 个答案:

答案 0 :(得分:2)

嵌套numpy数组似乎不是构建数据的最佳方法。您可以使用numpy的structured arrays来创建更直观的数据结构。

 # In my virtualenv
 pip uninstall psycopg2
 pip install psycopg2

使用这种生成数据的方式在我的机器上在2秒内创建了2,000,000个长阵列。

要使其适用于import numpy as np # example values values = [ { "v1": 0, "v2": 1, "vs": { "x": 2, "y": 3, "z": 4, } }, { "v1": 5, "v2": 6, "vs": { "x": 7, "y": 8, "z": 9, } } ] def value_to_record(value): """Take a dictionary and convert it to an array-like format""" return ( value["v1"], value["v2"], ( value["vs"]["x"], value["vs"]["y"], value["vs"]["z"] ) ) # define what a record looks like -- f8 is an 8-byte float dtype = [ ("v1", "f8"), ("v2", "f8"), ("vs", [ ("x", "f8"), ("y", "f8"), ("z", "f8") ]) ] # create actual array arr = np.fromiter(map(value_to_record, values), dtype=dtype, count=len(values)) # access individual record print(arr[0]) # prints (0.0, 1.0, (2.0, 3.0, 4.0)) # access specific value assert arr[0]['vs']['x'] == 2 # access all values of a specific field print(arr['v2']) # prints [ 1. 6.] assert arr['v2'].sum() == 7 个对象,请按ResObj属性对其进行排序,然后将loc属性传递给res函数。

答案 1 :(得分:1)

您可以按密钥名称在流程之间分配工作 在这里,我创建了一个工作池,并将var和可选的子变量名称传递给它们 使用便宜的fork与工作人员共享庞大的数据集 Unpacker.unpack从ResObj中选择指定的变量并将其作为np.array返回 make_final_result中的主循环组合了fi​​nal_result中的数组 的的Py2

from collections import defaultdict
from multiprocessing import Process, Pool
import numpy as np

class ResObj(object):
    def __init__(self, index=None, result=None):
        self.loc = index ### This is the location, where the values should go in the final result dictionary
        self.res = result ### This is a dictionary that has values for this location.

        self.loc = 2
        self.res = {'value1':5.4, 'value2':2.3, 'valuen':{'sub_value1':4.5, 'sub_value2':3.4, 'sub_value3':7.6}}

class Unpacker(object):
    @classmethod
    def cls_init(cls, list_of_results):
        cls.list_of_results = list_of_results

    @classmethod
    def unpack(cls, var, name):

        list_of_results = cls.list_of_results
        result = np.zeros(len(list_of_results))
        if name is None:
            for i, it in enumerate(list_of_results):
                result[i] = it.res[var]
        else:
            for i, it in enumerate(list_of_results):
                result[i] = it.res[var][name]
        return var, name, result

#Pool.map doesn't accept instancemethods so the use of a wrapper
def Unpacker_unpack((var, name),):
    return Unpacker.unpack(var, name)


def make_final_result(list_of_results):
    no_sub_result_variables = ['value1', 'value2']
    sub_result_variables = ['valuen']
    sub_value_variables = ['sub_value1', 'sub_value3', 'sub_value3']

    pool = Pool(initializer=Unpacker.cls_init, initargs=(list_of_results, ))
    final_result = defaultdict(dict)

    def key_generator():
        for var in no_sub_result_variables:
            yield var, None
        for var in sub_result_variables:
            for name in sub_value_variables:
                yield var, name

    for var, name, result in pool.imap(Unpacker_unpack, key_generator()):
        if name is None:
            final_result[var] = result
        else:
            final_result[var][name] = result
    return final_result

if __name__ == '__main__':
    print make_final_result([ResObj() for x in xrange(10)])

确保您不在Windows上。它缺少fork,并且多处理必须将整个数据集传输到24个工作进程中的每一个 希望这会有所帮助。

答案 2 :(得分:0)

删除一些缩进以使循环非嵌套:

for obj in list_of_results:
    i = obj.loc
    result = obj.res
    for var in no_sub_result_variables:
        final_result[var][i] = result[var]
    for var in sub_result_variables:
        for name in sub_value_variables:
            try:
                final_result[var][name][i] = result[var][name]
            except KeyError as e:
                ##TODO Add a debug check
                pass