Question

我想将输入流分块以进行批处理。给定输入列表或生成器，

x_in = [1, 2, 3, 4, 5, 6 ...]

我想要一个返回该输入块的函数。比如，chunk_size=4，那么，

x_chunked = [[1, 2, 3, 4], [5, 6, ...], ...]

这是我一遍又一遍地做的事情，并且想知道是否有比我自己写的更标准的方式。我错过了itertools中的内容吗？（人们可以通过enumerate和groupby来解决问题，但这感觉很笨拙。）如果有人想看到实现，那么就是这样，

def chunk_input_stream(input_stream, chunk_size):
    """partition a generator in a streaming fashion"""
    assert chunk_size >= 1
    accumulator = []
    for x in input_stream:
        accumulator.append(x)
        if len(accumulator) == chunk_size:
            yield accumulator
            accumulator = []
    if accumulator:
        yield accumulator

修改

受kreativitea的回答启发，这是一个islice的解决方案，这是一个简单明了的解决方案。不需要后期过滤，

from itertools import islice

def chunk_input_stream(input_stream, chunk_size):
    while True:
        chunk = list(islice(input_stream, chunk_size))
        if chunk:
            yield chunk
        else:
            return

# test it with list(chunk_input_stream(iter([1, 2, 3, 4]), 3))

Answer 1

来自itertools的食谱：

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

Answer 2

[感谢OP的更新版本：自从我升级以来，我一直在向所有人展示yield from，我甚至没有想到这里我不需要它。]

哦，到底是什么：

from itertools import takewhile, islice, count

def chunk(stream, size):
    return takewhile(bool, (list(islice(stream, size)) for _ in count()))

给出：

>>> list(chunk((i for i in range(3)), 3))
[[0, 1, 2]]
>>> list(chunk((i for i in range(6)), 3))
[[0, 1, 2], [3, 4, 5]]
>>> list(chunk((i for i in range(8)), 3))
[[0, 1, 2], [3, 4, 5], [6, 7]]

警告：如果输入是列表，则上述问题与OP的chunk_input_stream相同。你可以通过额外的iter()换行解决这个问题，但这不太美观。从概念上讲，使用repeat或cycle可能比count()更有意义，但我出于某种原因进行了字符计数。：^）

[FTR：不，我仍对此并不完全认真，但嘿 - 这是星期一。]

Answer 3

你有没有理由不使用这样的东西？：

# data is your stream, n is your chunk length
[data[i:i+n] for i in xrange(0,len(data),n)]

修改

因为人们正在制造发电机......

def grouper(data, n): results = [data[i:i+n] for i in xrange(0,len(data),n)] for result in results: yield result

编辑2 ：

我在想，如果你将内存中的输入流作为双端队列，你可以.popleft非常有效地产生n个对象。

from collections import deque stream = deque(data) def chunk(stream, n): """ Returns the next chunk from a data stream. """ return [stream.popleft() for i in xrange(n)] def chunks(stream, n, reps): """ If you want to yield more than one chunk. """ for item in [chunk(stream, n) for i in xrange(reps)]: yield item

python：是否有用于分块输入流的库函数？

修改

3 个答案: