如何加速这个Pandas二进制搜索功能?

时间:2017-11-01 18:34:51

标签: python performance pandas dataframe

我目前正在将Java / Hibernate系统移植到Python,使用Pandas数据帧将其运行的数据存储在内存中。我的代码目前运行得太慢了。对代码进行分析表明,此功能是一个瓶颈:

def find_names_in_explode(id, normalized_name, data):
    exploded_names = utils.explode_to_matchable_names(normalized_name)
    # This is a trick to use binary search on a dataframe.
    # I found out about it from https://www.youtube.com/watch?v=R2LiVJLGAHE.
    # Make sure the full data frame is sorted before passing in.
    start_and_end_strings = [(name[0:-1] + chr(ord(name[-1]) - 1),
                              name[0:-1] + chr(ord(name[-1]) + 1))
                                 for name in exploded_names if name]
    all_chunks = []
    for start_and_end in start_and_end_strings:
        start_index, end_index = data['name'].searchsorted(start_and_end)
        if start_index > 0 or end_index < data['name'].size:
            # searchsorted will return the whole data frame if it doesn't
            # find any matches; we're assuming that's never what we want.
            all_chunks.append(data.iloc[start_index : end_index])
    all_rows = pd.concat(all_chunks)
    return all_rows[(all_rows['id'] != id) & (~all_rows['id'].duplicated())]

将对主数据框的每一行运行此函数。主数据框的行都有名称和id(以及其他非相关列)。对于每一行,此函数将生成一组相关名称(例如,如果输入名称为“John R. Smith,MD”,则该集将包含“John Smith”,“John R. Smith”,“John Smith MD”等等,并找到数据框中与生成的集合中的一个名称匹配的所有行,将这些结果编译成新的数据帧,然后将其关闭以进行进一步处理。之前使用isin的更简单版本的速度太慢,所以我尝试使用this video中的技巧来加快速度,以进行二分搜索而不是线性搜索。这使它更快,但它还不够快。

此函数是主数据帧行上的二次循环的内部运算,我无法弄清楚如何进行向量化。我在具有一百万行的数据帧上对此进行了性能测试,但最终目标是在具有大约2亿行(大约60 GB的数据)的数据帧上运行该系统,因此它需要非常快。以下是对一百万行测试的分析结果的一部分:

         1877294411 function calls (1842146549 primitive calls) in 7907.774 seconds

   Ordered by: cumulative time
   List reduced from 3253 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3    0.352    0.117 7909.577 2636.526 main.py:1(<module>)
  50516/1   23.000    0.000 7909.574 7909.574 {built-in method builtins.exec}
        1    8.653    8.653 7908.866 7908.866 main.py:40(main)
   660206    9.805    0.000 7761.765    0.012 name_matchers.py:100(match)
   651044   18.768    0.000 7549.588    0.012 name_matchers.py:80(find_names_in_explode)
1249038/998908    3.316    0.000 3754.752    0.004 indexing.py:1317(__getitem__)
  2050871    8.807    0.000 3730.645    0.002 internals.py:2779(__init__)
  1210664    2.580    0.000 3722.063    0.003 indexing.py:1720(_getitem_axis)
   960534    1.908    0.000 3699.325    0.004 indexing.py:1689(_get_slice_axis)
   710404    0.744    0.000 3688.891    0.005 indexing.py:141(_slice)
   710404    3.949    0.000 3688.148    0.005 generic.py:1742(_slice)
  2050871   34.184    0.000 3679.895    0.002 internals.py:2876(_rebuild_blknos_and_blklocs)
   710404    5.405    0.000 3674.365    0.005 internals.py:3384(get_slice)
  4147176 3598.437    0.001 3598.437    0.001 {method 'fill' of 'numpy.ndarray' objects}
   689418    7.679    0.000 2698.926    0.004 ops.py:809(wrapper)
   689418 2557.017    0.004 2598.529    0.004 ops.py:755(na_op)
  7787062   45.116    0.000  366.018    0.000 series.py:139(__init__)
   651044    2.530    0.000  338.349    0.001 concat.py:21(concat)
  2740288   11.580    0.000  327.501    0.000 frame.py:1940(__getitem__)
   651046    3.578    0.000  255.852    0.000 frame.py:1983(_getitem_array)
   689418    6.704    0.000  247.067    0.000 ops.py:909(wrapper)
   689420    6.476    0.000  241.191    0.000 generic.py:1909(take)
   651044    5.944    0.000  221.647    0.000 concat.py:356(get_result)
   651044    2.388    0.000  205.428    0.000 internals.py:4814(concatenate_block_managers)
   689421    7.032    0.000  202.054    0.000 internals.py:3990(take)
  2089239    3.209    0.000  199.700    0.000 _decorators.py:65(wrapper)
  1378836    4.680    0.000  198.079    0.000 ops.py:913(<lambda>)
   651044    1.358    0.000  170.559    0.000 name_matchers.py:59(name_match)
   689421    3.174    0.000  156.992    0.000 internals.py:3860(reindex_indexer)
  4089923   18.004    0.000  134.581    0.000 series.py:2894(_sanitize_array)

函数match调用find_names_in_explode,您可以看到其累计运行时间的大部分都花在那里。有什么方法可以更好地利用熊猫或Numpy来加快速度吗?

0 个答案:

没有答案