如何在迭代pandas数据框的同时提高性能?

时间:2018-02-02 09:49:00

标签: python performance pandas

我有两个pandas数据帧。第一个包含从文本中提取的unigrams列表,文本中出现的unigram的计数和概率。结构如下所示:

unigram_df
    word            count       prob
0   we              109         0.003615
1   investigated    20          0.000663
2   the             1125        0.037315
3   potential       36          0.001194
4   of              1122        0.037215

第二个包含从同一文本中提取的跳过列表,以及文本中发生的跳过索引的计数和概率。它看起来像这样:

skipgram_df
    word                      count         prob
0   (we, investigated)        5             0.000055
1   (we, the)                 31            0.000343
2   (we, potential)           2             0.000022
3   (investigated, the)       11            0.000122
4   (investigated, potential) 3             0.000033

现在,我想计算每个跳数的逐点互信息,这基本上是一个跳过概率的对数除以其unigrams'的乘积。概率。我为此编写了一个函数,它迭代了skipgram df并且它正是我想要的工作方式,但是我的性能存在很大问题,我想问一下是否有办法改进我的代码以使其计算pmi快点。

这是我的代码:

def calculate_pmi(row):
    skipgram_prob = float(row[3])
    x_unigram_prob = float(unigram_df.loc[unigram_df['word'] == row[1][0]]
    ['prob'])
    y_unigram_prob = float(unigram_df.loc[unigram_df['word'] == row[1][1]]
    ['prob'])
    pmi = math.log10(float(skipgram_prob / (x_unigram_prob * y_unigram_prob)))
    result = str(str(row[1][0]) + ' ' + str(row[1][1]) + ' ' + str(pmi))
    return result 

pmi_list = list(map(calculate_pmi, skipgram_df.itertuples()))

现在该功能的性能约为483.18it / s,这是超级慢的,因为我有数十万个跳过迭代。欢迎大家提出意见。感谢。

1 个答案:

答案 0 :(得分:1)

pandas的新用户来说,这是一个很好的问题和练习。仅使用df.iterrows作为最后的手段,即使这样,也要考虑替代方案。这是正确选择的情况相对较少。

以下是如何进行计算矢量化的示例。

import pandas as pd
import numpy as np

uni = pd.DataFrame([['we', 109, 0.003615], ['investigated', 20, 0.000663],
                    ['the', 1125, 0.037315], ['potential', 36, 0.001194],
                    ['of', 1122, 0.037215]], columns=['word', 'count', 'prob'])

skip = pd.DataFrame([[('we', 'investigated'), 5, 0.000055],
                     [('we', 'the'), 31, 0.000343],
                     [('we', 'potential'), 2, 0.000022],
                     [('investigated', 'the'), 11, 0.000122],
                     [('investigated', 'potential'), 3, 0.000033]],
                    columns=['word', 'count', 'prob'])

# first split column of tuples in skip
skip[['word1', 'word2']] = skip['word'].apply(pd.Series)

# set index of uni to 'word'
uni = uni.set_index('word')

# merge prob1 & prob2 from uni to skip
skip['prob1'] = skip['word1'].map(uni['prob'].get)
skip['prob2'] = skip['word2'].map(uni['prob'].get)

# perform calculation and filter columns
skip['result'] = np.log(skip['prob'] / (skip['prob1'] * skip['prob2']))
skip = skip[['word', 'count', 'prob', 'result']]