如何提高python中大型列表的性能

时间:2016-07-29 12:39:57

标签: python performance list optimization

我有一个很大的列表,上面写着1000万个整数(已排序)" alist"。我需要的是获得一些整数(来自" blist")和列表中的邻居之间的最小距离。我通过找到我寻找的整数的位置来做到这一点,获得前后的项目并测量差异:

alist=[1, 4, 30, 1000, 2000] #~10 million integers
blist=[4, 30, 1000] #~8 million integers

for b in blist:
    position=alist.index(b)
    distance=min([b-alist[position-1],alist[position+1]-b])

此操作必须重复数百万次,不幸的是,我的机器需要很长时间。有没有办法提高此代码的性能?我使用python 2.6并且python 3不是一个选项。

2 个答案:

答案 0 :(得分:4)

我建议使用二进制搜索。使它更快,不需要额外的内存,只需要一点点改变。而不是alist.index(b),只需使用bisect_left(alist, b)

如果您的blist也已排序,您还可以使用非常简单的增量搜索,不是从b的开头搜索当前的alist,而是从b的索引搜索之前的389700.01 seconds Andy_original (time estimated) 377100.01 seconds Andy_no_lists (time estimated) 6.30 seconds Stefan_binary_search 2.15 seconds Stefan_incremental_search 3.57 seconds Stefan_incremental_search2 1.21 seconds Jacquot_NumPy (0.74 seconds Stefan_only_search_no_distance)

基于 Python 2.7.11 的基准测试以及包含1000万和800万个整数的列表:

blist

安迪的原件需要大约4.5天,所以我只使用distance = min(...)的每100000个条目并按比例放大。二进制搜索速度更快,增量搜索速度更快,而NumPy可以全部击败它们,尽管它们只需要几秒钟。

0.74秒的最后一个条目是没有distance = min(...)行的增量搜索,因此无法比较。但它表明搜索仅占总2.15秒的34%。所以我可以做的更多,因为现在大多数时候509819.56 seconds Andy_original (time estimated) 505257.32 seconds Andy_no_lists (time estimated) 8.35 seconds Stefan_binary_search 4.61 seconds Stefan_incremental_search 4.53 seconds Stefan_incremental_search2 1.39 seconds Jacquot_NumPy (1.45 seconds Stefan_only_search_no_distance) 计算是负责的。

Python 3.5.1 的结果类似:

def Andy_original(alist, blist):
    for b in blist:
        position = alist.index(b)
        distance = min([b-alist[position-1], alist[position+1]-b])

def Andy_no_lists(alist, blist):
    for b in blist:
        position = alist.index(b)
        distance = min(b-alist[position-1], alist[position+1]-b)

from bisect import bisect_left
def Stefan_binary_search(alist, blist):
    for b in blist:
        position = bisect_left(alist, b)
        distance = min(b-alist[position-1], alist[position+1]-b)

def Stefan_incremental_search(alist, blist):
    position = 0
    for b in blist:
        while alist[position] < b:
            position += 1
        distance = min(b-alist[position-1], alist[position+1]-b)

def Stefan_incremental_search2(alist, blist):
    position = 0
    for b in blist:
        position = alist.index(b, position)
        distance = min(b-alist[position-1], alist[position+1]-b)

import numpy as np
def Jacquot_NumPy(alist, blist):

    a_array = np.asarray(alist)
    b_array = np.asarray(blist)

    a_index = np.searchsorted(a_array, b_array) # gives the indexes of the elements of b_array in a_array

    a_array_left = a_array[a_index - 1]
    a_array_right = a_array[a_index + 1]

    distance_left = np.abs(b_array - a_array_left)
    distance_right = np.abs(a_array_right - b_array)

    min_distance = np.min([distance_left, distance_right], axis=0)

def Stefan_only_search_no_distance(alist, blist):
    position = 0
    for b in blist:
        while alist[position] < b:
            position += 1

from time import time
alist = list(range(10000000))
blist = [i for i in alist[1:-1] if i % 5]
blist_small = blist[::100000]

for func in Andy_original, Andy_no_lists:
    t0 = time()
    func(alist, blist_small)
    t = time() - t0
    print('%9.2f seconds %s (time estimated)' % (t * 100000, func.__name__))

for func in Stefan_binary_search, Stefan_incremental_search, Stefan_incremental_search2, Jacquot_NumPy, Stefan_only_search_no_distance:
    t0 = time()
    func(alist, blist)
    t = time() - t0
    print('%9.2f seconds %s' % (t, func.__name__))

包含所有版本和测试的完整代码:

iris

答案 1 :(得分:1)

我非常喜欢这种计算的Numpy模块。

在你的情况下,那就是(这是一个很长的答案,可以分解为更有效率):

import numpy as np

alist = [1, 4, 30, 1000, 2000]
blist = [4, 30, 1000]

a_array = np.asarray(alist)
b_array = np.asarray(blist)

a_index = np.searchsorted(a_array, b_array) # gives the indexes of the elements of b_array in a_array

a_array_left = a_array[a_index - 1]
a_array_right = a_array[a_index + 1]

distance_left = np.abs(b_array - a_array_left)
distance_right = np.abs(a_array_right - b_array)

min_distance = np.min([distance_left, distance_right], axis=0)

如果blist的第一个元素是alist的第一个元素,那么它将无效。 我想:

alist = [b[0] - 1] + alist + [b[-1] + 1]

是一种肮脏的解决方法。

<强>基准
&#34;仍在运行&#34;我可能是我的电脑故障..

alist = sorted(list(np.random.randint(0, 10000, 10000000)))
blist = sorted(list(alist[1000000:9000001]))
a_array = np.asarray(alist)
b_array = np.asarray(blist)

矢量化解决方案

%%timeit
a_index = np.searchsorted(a_array, b_array)

a_array_left = a_array[a_index - 1]
a_array_right = a_array[a_index + 1]

min_distance = np.min([b_array - a_array_left, a_array_right - b_array], axis=0)
1 loop, best of 3: 591 ms per loop

二进制搜索解决方案

%%timeit
for b in blist:
    position = bisect.bisect_left(alist, b)
    distance = min([b-alist[position-1],alist[position+1]-b])
Still running..

OP的解决方案

%%timeit
for b in blist:
    position=alist.index(b)
    distance=min([b-alist[position-1],alist[position+1]-b])
Still running..

较小的输入

alist = sorted(list(np.random.randint(0, 10000, 1000000)))
blist = sorted(list(alist[100000:900001]))
a_array = np.asarray(alist)
b_array = np.asarray(blist)

矢量化解决方案

%%timeit
a_index = np.searchsorted(a_array, b_array)

a_array_left = a_array[a_index - 1]
a_array_right = a_array[a_index + 1]

min_distance = np.min([b_array - a_array_left, a_array_right - b_array], axis=0)
10 loops, best of 3: 53.2 ms per loop

二进制搜索解决方案

%%timeit
for b in blist:
    position = bisect.bisect_left(alist, b)
    distance = min([b-alist[position-1],alist[position+1]-b])
1 loop, best of 3: 1.57 s per loop

OP的解决方案

%%timeit
for b in blist:
    position=alist.index(b)
    distance=min([b-alist[position-1],alist[position+1]-b])
Still running..