Question

我可以访问生成两个值的生成器：

def get_document_values():
    docs = query_database()  # returns a cursor to database documents
    for doc in docs:
        # doc is a dictionary with ,say, {'x': 1, 'y': 99}
        yield doc['x'], doc['y']

我还有另一个功能process_x，我无法更改它可以将生成器作为输入来处理所有文档的所有x（如果产生了元组，那么它只会处理元组的第一个元素，而忽略其他元素）：

X = process_x(get_document_values())  # This processes x but ignores y

但是，我还需要存储生成器中的所有y值。我唯一的解决方案是两次执行get_document_values：

Y = [y for x,y in get_document_values()]  #Throw away x
X = process_x(get_document_values())      #Throw away y

从技术上讲，这是可行的，但是当要处理的文档很多时，有可能会将新文档插入数据库，并且X和Y的长度会有所不同。 X和Y之间需要一对一的映射，我只想调用一次get_document_values而不是两次。

我考虑过类似的事情：

Y = []

def process_y(doc_generator):
    global Y
    for x,y in doc_generator:
        Y.append(y)
        yield x

X = process_x(process_y(get_document_values()))

但是：

这感觉不是pythonic
Y需要声明为全局变量

我希望有一种更干净，更Python化的方法来实现此目的。

更新

实际上，get_document_values返回的x值太大而无法集中存储到内存中，而process_x实际上正在减少该内存需求。因此，不可能缓存所有x。缓存所有y都很好。

Answer 1

调用时您已经将所有值缓存到列表中了

all_values = [(x,y) for x,y in get_document_values()] #or list(get_document_values())

您可以使用以下方法获得y值的迭代器：

Y = map(itemgetter(1), all_values)

为x简单使用：

X = process_x(map(itemgetter(0), all_values))

另一种选择是分离生成器，例如：

def get_document_values(getter):
    docs = query_database()  # returns a cursor to database documents
    for doc in docs:
        # doc is a dictionary with ,say, {'x': 1, 'y': 99}
        yield getter(doc)

from operator import itemgetter
X = process_x(get_document_values(itemgetter("x")))
Y = list(get_document_values(itemgetter("y")))

这样，您将必须进行两次查询，如果找到一次进行查询并复制光标的方式，则还可以对其进行抽象：

def get_document_values(cursor, getter):
    for doc in cursor:
        # doc is a dictionary with ,say, {'x': 1, 'y': 99}
        yield getter(doc)

Answer 2

无需保存数据：

def process_entry(x, y):
    process_x((x,))
    return y

ys = itertools.starmap(process_entry, your_generator)

请记住，只有，当您获得y值时，将处理其对应的x值。

如果您都选择了两者，则将它们都作为元组返回：

def process_entry(x, y):
    return next(process_x((x,))), y

Answer 3

您可能要使用itertools.tee从一个迭代器中创建两个迭代器，然后将一个迭代器用于process_x，将另一个迭代器用于另一个目的

Answer 4

可能不是pythonic，但是如果允许稍微改变主生成器并利用其函数属性，则可以作弊：

from random import randrange
def get_vals():
        # mock creation of a x/y dict list
        docs =[{k: k+str(randrange(50)) for k in ('x','y')} for _ in range(10)]
        # create a function list attribute
        get_vals.y = []
        for doc in docs:
            # store the y value into the attribute
            get_vals.y.append(doc['y'])
            yield doc['x'], doc['y']  
            # if doc['y'] is purely for storage, you  might opt to not yield it at all.

测试一下：

# mock the consuming of generator for process_x            
for i in get_vals():
    print(i)    
# ('x13', 'y9'), ('x15', 'y40'), ('x41', 'y49')...

# access the ys stored in get_val function attribute after consumption
print(get_vals.y)
# ['y9', 'y40', 'y49', ...]

# instantiate the generator a second time a la process_x...
for i in get_vals():
    print(i)
# ('x18', 'y0'), ('x6', 'y35'), ('x24', 'y45')...

# access the cached y again
print(get_vals.y)
# ['y0', 'y35', 'y45', ...]

当生成器为每个调用输出其x时，这基本上会缓存y值。
它消除了您的global关键字
您可以确定x / y映射正确无误。

有些人可能会认为这是一种hack，但是我想将其视为一种功能，因为Python中的所有内容都是一个对象，它使您可以摆脱这种情况...

Python将生成器产量分成两部分

4 个答案: