大熊猫:基于多个列生成分数

时间:2019-05-03 11:01:53

标签: python pandas

import random, string, time
import pandas as pd

random.seed(1)
toy_set = pd.DataFrame({'group': [str(i)+'_'+str(j) for i in range(40000) for j in range(25)],
                        'feature1': random.choices(string.ascii_letters, k = 1000000),
                        'feature2': random.choices(string.ascii_letters, k = 1000000),
                        'feature3': random.choices(range(10), k=1000000)
                        })

#create hypothetical scoring dict
eventScores = {}
for k in toy_set.groupby(['feature1', 'feature2','feature3']).groups.keys():
    if k[0] not in eventScores:
        eventScores[k[0]] = {}
    if k[1] not in eventScores[k[0]]:
        eventScores[k[0]][k[1]] = {}
    eventScores[k[0]][k[1]][k[2]] = random.randint(1,10)   

def calc_x(subset):
    return subset.apply(lambda x: eventScores[x['feature1']][x['feature2']][x['feature3']],
                            axis =1)

t = time.time()
toy_set['x'] = calc_x(toy_set) 
print(round(time.time() - t))

我有一个具有3个功能的df,基于此我为每行生成一个分数(在这种情况下,仅出于示例目的而随机指定每种情况的分数)。

是否有一种更快的方法来生成x,而不是进行嵌套的dict替换? (这套设备目前在我的W10 I7上要花约30秒,而实际的要大15倍)

1 个答案:

答案 0 :(得分:0)

尝试使用dict comprehension来重组eventScores,然后对串联的功能使用Series.map

d_map = {f"{k1}_{k2}_{k3}":v3 for k1, v1 in eventScores.items() for k2, v2 in v1.items() for k3, v3 in v2.items()}

toy_set['x'] = (toy_set['feature1'].astype(str) + '_' + 
                toy_set['feature2'].astype(str) + '_' + 
                toy_set['feature3'].astype(str)).map(d_map)

时间

# This method
898 ms ± 9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Original method
25.3 s ± 497 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)