Question

我正在处理一个非常大的数据集（大约7500万个条目），我正在尝试缩短运行代码所花费的时间长度（现在有一个循环它需要一对天）并保持内存使用率极低。

我有两个长度相同的numpy数组（clients和units）。我的目标是在我的第一个列表（clients）中获取值出现的每个索引的列表，然后在每个索引中查找我的第二个列表中的条目总和。

这是我尝试过的（np是先前导入的numpy库）

# create a list of each value that appears in clients
unq = np.unique(clients)
arr = np.zeros(len(unq))
tmp = np.arange(len(clients))
# for each unique value i in clients
for i in range(len(unq)) :
    #create a list inds of all the indices that i occurs in clients
    inds = tmp[clients==unq[i]]
    # add the sum of all the elements in units at the indices inds to a list
    arr[i] = sum(units[inds])

有没有人知道一种方法可以让我找到这些总和而不会遍历unq中的每个元素？

Answer 1

使用Pandas，可以使用grouby()函数轻松完成此操作：

import pandas as pd

# some fake data
df = pd.DataFrame({'clients': ['a', 'b', 'a', 'a'], 'units': [1, 1, 1, 1]})

print df.groupby(['clients'], sort=False).sum()

为您提供所需的输出：

         units
clients       
a            3
b            1

我使用sort=False选项，因为这可能会导致加速（默认情况下，条目将被排序，这可能需要一些时间来处理大型数据集）。

Answer 2

这是一种典型的分组操作，使用numpy-indexed包可以优雅高效地执行（免责声明：我是作者）：

import numpy_indexed as npi
unique_clients, units_per_client = npi.group_by(clients).sum(units)

请注意，与pandas方法不同，不需要创建临时数据结构来执行此类基本操作。

列表python中每个唯一元素的所有索引

2 个答案: