为什么通过布尔掩码过滤DataFrame比apply()快得多?

时间:2018-01-18 10:18:11

标签: python pandas dataframe

我想比较两种不同方法之间的性能来过滤pandas DataFrames。所以我创建了一个在平面上有this.rezerwacjeFilteredByseaarchInput.sort(function (a, b) { if (a[5] === null) { return 1; } if (firmaSortOrder) { return a[5] - b[5]; } return b[5] - a[5]; }); 点的测试集,我过滤掉了不在单位平方中的所有点。我很惊讶一种方法比另一种方法快得多。 n越大,差异越大。对此有何解释?

这是我的剧本

n

import numpy as np import time import pandas as pd # Test set with points n = 100000 test_x_points = np.random.uniform(-10, 10, size=n) test_y_points = np.random.uniform(-10, 10, size=n) test_points = zip(test_x_points, test_y_points) df = pd.DataFrame(test_points, columns=['x', 'y']) # Method a start_time = time.time() result_a = df[(df['x'] < 1) & (df['x'] > -1) & (df['y'] < 1) & (df['y'] > -1)] end_time = time.time() elapsed_time_a = 1000 * abs(end_time - start_time) # Method b start_time = time.time() result_b = df[df.apply(lambda row: -1 < row['x'] < 1 and -1 < row['y'] < 1, axis=1)] end_time = time.time() elapsed_time_b = 1000 * abs(end_time - start_time) # print results print 'For {0} points.'.format(n) print 'Method a took {0} ms and leaves us with {1} elements.'.format(elapsed_time_a, len(result_a)) print 'Method b took {0} ms and leaves us with {1} elements.'.format(elapsed_time_b, len(result_b)) print 'Method a is {0} X faster than method b.'.format(elapsed_time_b / elapsed_time_a) 的不同值的结果:

n

当我将它与Python本地列表理解方法进行比较时,a仍然快得多

For 10 points.
Method a took 1.52087211609 ms and leaves us with 0 elements.
Method b took 0.456809997559 ms and leaves us with 0 elements.
Method a is 0.300360558081 X faster than method b.

For 100 points.
Method a took 1.55997276306 ms and leaves us with 1 elements.
Method b took 1.384973526 ms and leaves us with 1 elements.
Method a is 0.887819043252 X faster than method b.

For 1000 points.
Method a took 1.61004066467 ms and leaves us with 5 elements.
Method b took 10.448217392 ms and leaves us with 5 elements.
Method a is 6.48941211313 X faster than method b.

For 10000 points.
Method a took 1.59096717834 ms and leaves us with 115 elements.
Method b took 98.8278388977 ms and leaves us with 115 elements.
Method a is 62.1180878166 X faster than method b.

For 100000 points.
Method a took 2.14099884033 ms and leaves us with 1052 elements.
Method b took 995.483875275 ms and leaves us with 1052 elements.
Method a is 464.962360802 X faster than method b.

For 1000000 points.
Method a took 7.07101821899 ms and leaves us with 10045 elements.
Method b took 9613.26599121 ms and leaves us with 10045 elements.
Method a is 1359.5306494 X faster than method b.

为什么?

1 个答案:

答案 0 :(得分:1)

如果您关注Pandas source code for apply,您会看到一般情况下它最终会进行python for __ in __循环。

然而,Pandas DataFrames由Pandas系列组成,它们由numpy数组组成。屏蔽过滤使用numpy数组允许的快速矢量化方法。有关为什么这比执行普通python循环更快的信息(如.apply),请参阅Why are NumPy arrays so fast?

那里的答案是:

  

Numpy数组是密集的同类型数组。蟒蛇   相比之下,列表是指向对象的指针数组,即使是全部   他们是同一类型。所以,你得到了地方的好处   参考

     

此外,许多Numpy操作都是在C中实现的,避免使用   Python中循环的代价,指针间接和每元素动态   类型检查。速度提升取决于您的操作   表演,但几个数量级的数量并不常见   嘎吱嘎吱的计划。