Question

我必须用Python编写一个脚本。我有一长串的整数，它们都是一个特定尺度的长度，当然有重复。我必须找到最佳的“间隔”来获得平衡的块。一个例子

[1,2,2,5,2,4,5,4,5]

使用Counter并对我获得的结果进行排序

[(1,1)(2,3)(3,1)(4,1)(5,3)]

如果我需要两个桶我计算元素的数量（在这种情况下为8）并且除了这个数量的桶数（4），所以我需要形成一个大约4个元素的桶。在我的代码中，我解析了元组列表，总结了元素的数量，直到这个数字大于4，所以

(1,1) >= 4? False
(1,1) + (2,3) = 4 >=4? True, break;

所以第一个间隔是1-2，而不是

(3,1) >=4? False
(3,1)+(4,1) >=4? False
(3,1)+(4,1)+(5,3) >=4? True

所以第二个间隔是3-5 在我的数据集中，我有数十万个元素，所以这个任务（计数，排序，解析）非常耗时。有没有办法加快它？

Answer 1

这是一种创建大小相等的连续桶的方法。它使用collections.Counter，heapq.merge，itertools.accumulate和itertools.groupby

充分利用了标准库

from itertools import groupby, accumulate
from heapq import merge
from collections import Counter
from math import sin, pi
import random

# make test data a bit uneven
def mock_data(N):
    return [int(sin(2*pi*random.random())*50 + 50) for _ in range(N)]

N = 1000000

data = mock_data(N)

counts = Counter(data)
srtcnts = sorted(counts.items())

k = 7 # number of buckets

slabels, scounts = zip(*srtcnts)
# compute cumulative bin centers
bincntrs = (a - c/2 for a, c in zip(accumulate(scounts), scounts))
# mix in the optimal boundaries
split = merge(zip(bincntrs, slabels), zip(range(0, N, -(-N//k))))
# group into boundaries and stuff between boundaries;
# keep only the stuff between
res = [[v[1] for v in grp] for k, grp in groupby(split, len) if k==2]

print(res)
# show they are balanced
print([sum(counts[i] for i in chunk) for chunk in res])

示例输出：

[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38], [39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60], [61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80], [81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94], [95, 96, 97, 98, 99]]
[143297, 143387, 142010, 141358, 143224, 143617, 143107]

将长度列表分成平衡块

1 个答案: