Question

我有一个二维np数组，其列数是行数的100倍。例如，如果行数是1000，则列数是100,000，并且值都是整数。我的目标是为1000个行索引中的每个索引返回1000个唯一整数。列中的值并非都唯一（可能有重复项），因此我必须在每一行中搜索所有值，以选择行中尚未被上一操作选择的第一个整数值。我有这个可重现的循环，适用于约1000个较小的num_rows。但是，当涉及到处理超过10,000行时，这非常缓慢。有没有更有效的方法来解决这个问题？

import numpy as np
maxval = 5000
matrix = np.random.randint(maxval,size=(maxval, maxval*100))
neighbours = maxval - 1
indices = [] #this array will contain the outputs after the loop gets completed
for e in matrix:
    i = 0
    while i < neighbours:
        if e[i] in indices:
            i += 1
        else:
            indices.append(e[i])
            break

Answer 1

这不是一种麻木的方式，但是如果row有100,000个元素，那么

import random

random.sample(set(row), 1000)

是其中的1000个唯一元素的随机样本。

注意：

如果一个数字比另一个数字更频繁地出现，那么他们仍然有相同的机会被选中
如果唯一值的数量小于1000，则会引发ValueError
我可能不知道这两者都存在一定数量的麻木

Answer 2

您可以使用set代替查找列表：

import numpy as np
maxval = 50
matrix = np.random.randint(maxval,size=(maxval, maxval*100))
neighbours = maxval - 1
indices = set() #this array will contain the outputs after the loop gets completed
for e in matrix:
    i = 0
    while i < neighbours:
        if e[i] in indices:
            i += 1
        else:
            indices.add(e[i])
            break

这里有live example

Answer 3

使用字典会更快，但我不知道是否足够：

from collections import OrderedDict

indx = OrderedDict()

for e in matrix:
    i = 0
    while i < neighbours:
        v = e[i]
        if indx.get(v) is None:
            indx[v] = True
            break
        i += 1

results = list(indx.keys())

有没有办法加快这个Python循环？

3 个答案: