Question

我目前正在开发一个高性能的python 2.7项目，它利用列表中的数千个元素。显然，每项操作都必须尽快完成。

所以，我有两个列表：其中一个是唯一任意数字的列表，我们称之为A，另一个是一个以1开头并且长度与名为B的第一个列表，表示A中的索引（从1开始）

枚举，从1开始。

例如：

A = [500, 300, 400, 200, 100] # The order here is arbitrary, they can be any integers, but every integer can only exist once
B = [  1,   2,   3,   4,   5] # This is fixed, starting from 1, with exactly as many elements as A

如果我有一个B元素（称为e_B）并想要A中的相应元素，我可以简单地执行correspond_e_A = A[e_B - 1]。没问题。

但是现在我有一个庞大的随机，非唯一整数列表，我想知道A中整数的索引，以及B中相应的元素是什么。

我认为我对第一个问题有一个合理的解决方案：

indices_of_existing = numpy.nonzero(numpy.in1d(random_list, A))[0]

这种方法的好处在于不需要map（）单个操作，numpy的in1d只返回一个像[True，True，False，True，...]这样的列表。使用nonzero（）我可以得到在A.中存在的random_list中元素的索引。我认为完美。

但是对于第二个问题，我很难过。我试过像：

corresponding_e_B = map(lambda x: numpy.where(A==x)[0][0] + 1, random_list))

这正确地给了我索引，但它不是最优的，因为首先我需要一个map（），其次我需要一个lambda，最后numpy.where（）在项目被发现之后不会停止（记住，A只有独特的元素），这意味着它可以像我这样的巨大数据集进行可怕的扩展。

我看了一下bisect，但似乎bisect只适用于单个请求，而不是列表，这意味着我仍然必须使用map（）并按元素构建我的列表（这很慢，不是吗？）

由于我对Python很陌生，我希望这里的任何人都有想法？也许是一个我还不知道的图书馆？

Answer 1

我认为建议您使用散列表代替numpy.in1d而不是PyPy，它使用O(n log n)合并排序作为预处理步骤。

>>> A = [500, 300, 400, 200, 100]
>>> index = { k:i for i,k in enumerate(A, 1) }
>>> random_list = [200, 100, 50]
>>> [i for i,x in enumerate(random_list) if x in index]
[0, 1]
>>> map(index.get, random_list)
[4, 5, None]
>>> filter(None, map(index.get, random_list))
[4, 5]

这是Python 2，在Python 3中map和filter返回类似于生成器的对象，因此如果要将结果作为列表获取，则需要list周围过滤器

我试图尽可能地使用内置函数来将计算负担推到C端（假设您使用CPython）。所有的名字都是预先解决的，所以它应该非常快。

通常，为了获得最佳性能，您可能需要考虑使用{{3}}，这是一个使用JIT编译的很好的替代Python实现。

比较多种方法的基准从来都不是一个坏主意：

import sys
is_pypy = '__pypy__' in sys.builtin_module_names

import timeit
import random
if not is_pypy:
  import numpy
import bisect

n = 10000
m = 10000
q = 100

A = set()
while len(A) < n: A.add(random.randint(0,2*n))
A = list(A)

queries = set()
while len(queries) < m: queries.add(random.randint(0,2*n))
queries = list(queries)

# these two solve question one (find indices of queries that exist in A)
if not is_pypy:
  def fun11():
    for _ in range(q):
      numpy.nonzero(numpy.in1d(queries, A))[0]

def fun12():
  index = set(A)
  for _ in range(q):
    [i for i,x in enumerate(queries) if x in index]

# these three solve question two (find according entries of B)
def fun21():
  index = { k:i for i,k in enumerate(A, 1) }
  for _ in range(q):
    [index[i] for i in queries if i in index]

def fun22():
  index = { k:i for i,k in enumerate(A, 1) }
  for _ in range(q):
    list(filter(None, map(index.get, queries)))

def findit(keys, values, key):
  i = bisect.bisect(keys, key)
  if i == len(keys) or keys[i] != key:
    return None
  return values[i]

def fun23():
  keys, values = zip(*sorted((k,i) for i,k in enumerate(A,1)))
  for _ in range(q):
    list(filter(None, [findit(keys, values, x) for x in queries]))

if not is_pypy:
  # note this does not filter out nonexisting elements
  def fun24():
    I = numpy.argsort(A)
    ss = numpy.searchsorted
    maxi = len(I)
    for _ in range(q):   
      a = ss(A, queries, sorter=I)
      I[a[a<maxi]]

tests = ("fun12", "fun21", "fun22", "fun23")
if not is_pypy: tests = ("fun11",) + tests + ("fun24",)

if is_pypy:
  # warmup
  for f in tests:
    timeit.timeit("%s()" % f, setup = "from __main__ import %s" % f, number=20)

# actual timing
for f in tests:
  print("%s: %.3f" % (f, timeit.timeit("%s()" % f, setup = "from __main__ import %s" % f, number=3)))

结果：

$ python2 -V
Python 2.7.6
$ python3 -V
Python 3.3.3
$ pypy -V
Python 2.7.3 (87aa9de10f9ca71da9ab4a3d53e0ba176b67d086, Dec 04 2013, 12:50:47)
[PyPy 2.2.1 with GCC 4.8.2]
$ python2 test.py
fun11: 1.016
fun12: 0.349
fun21: 0.302
fun22: 0.276
fun23: 2.432
fun24: 0.897
$ python3 test.py
fun11: 0.973
fun12: 0.382
fun21: 0.423
fun22: 0.341
fun23: 3.650
fun24: 0.894
$ pypy ~/tmp/test.py
fun12: 0.087
fun21: 0.073
fun22: 0.128
fun23: 1.131

您可以向您的方案调整n（A的大小），m（random_list的大小）和q（查询数）。令我惊讶的是，我试图变得聪明并使用内置函数而不是list comps并没有得到回报，因为fun22并不比fun21快很多（在Python 2和~25中只有~10％） Python 3中的％，但PyPy中慢了近75％）。这里是一个过早优化的案例。我想差异是因为fun22在Python 2中为每个查询构建了一个不必要的临时列表。我们也发现二进制搜索非常糟糕（请查看fun23）。

Answer 2

def numpy_optimized(index, values):
    I = np.argsort(values)
    Q = np.searchsorted(values, index, sorter=I)
    return I[Q]

这计算与OP相同的东西，但索引与查询的值匹配顺序，我想这是功能的改进。它的速度是我机器上OP解决方案的两倍;如果我正确地解释你的基准测试，它会略微领先于非pypy解决方案。

或者，如果我们不能假设所有索引都存在于值中，并且希望以静默方式删除丢失的查询：

def numpy_optimized_filtered(index, values):
    I = np.argsort(values)
    Q = np.searchsorted(values, index, sorter=I)
    Z = I[Q]
    return Z[values[Z]==index]

Python：两个列表之间的快速映射和查找

2 个答案: