Question

我有〜1000个独特商品的列表

np.random.seed(0)
unique1 = sorted(list(np.random.choice(np.arange(2000), 1000, False)))

，还有{1}行中的pandas df.column行，其中仅包含此列表中的整数。

df = pd.DataFrame({'a': np.sort(np.random.choice(unique1[1:], 12000000))})

我需要做的是创建一个新列，其唯一列表中的元素总是比原始列中的元素更早。

我尝试使用apply来做到这一点，但是效率是可笑的，并且可以使用一个普通的循环（在系统上大约2分钟），但是我想知道是否可以更高效地到达那里（用于说明目的的数字较小）：

np.random.seed(0)
unique1 = sorted(list((np.random.choice(np.arange(20), 10, False))))
df = pd.DataFrame({'a': np.sort(np.random.choice(unique1[1:], 15))})

unique2 = unique1[1:]
df['b'] = df.a.apply(lambda x: unique1[unique2.index(x)])

newCol = []
for item in list(df.a):
    newCol.append(unique1[unique2.index(item)])
df['c'] = newCol
print(df, unique1)
     a   b   c
0    2   1   1
1    2   1   1
2    4   2   2
3    6   4   4
4    8   6   6
5    8   6   6
6    8   6   6
7   10   8   8
8   13  10  10
9   13  10  10
10  17  13  13
11  18  17  17
12  18  17  17
13  19  18  18
14  19  18  18 [1, 2, 4, 6, 8, 10, 13, 17, 18, 19]

Answer 1

这里的问题是您正在使用list.index，它会对所有唯一值进行线性搜索。

如果您有足够的空间来构建字典，则可以将其转换为恒定时间查找：

unique2 = {value: index for index, value in enumerate(unique1[1:])}
df['b'] = df.a.apply(lambda x: unique1[unique2[x]])

如果不能（在这种情况下，应将值保留在数组或Slice中，而不要放在第一位……），只要保持其排序，就至少可以对数而不是比使用bisect或np.searchsorted的线性时间：

df['b'] = df.a.apply(lambda x: unique1[np.searchsorted(unique2, x)])

（如果unique2是一个数组而不是一个列表，但速度只是一个常数，这会更快；它仍然是列表的对数时间。）

用唯一列表中的后一个替换列值

1 个答案: