返回两个字符串

时间:2018-03-08 08:48:20

标签: python string pandas

我正在处理大约400k行预处理字符串的数据集。

[In]:
raw                                preprocessed

helpersstreet 46, second floor     helpersstreet 46

489 john doe route                 john doe route

at main street 49                  main street

“preprocessed”列中的所有字符串都与“raw”列相同或更小。有没有一种快速的方法来比较这些字符串并返回所有差异,将它们放在一列中:

[Out]:
raw                                preprocessed        difference

helpersstreet 46, second floor     helpersstreet 46    ,second floor

489 john doe route                 john doe route      489

at main street 49                  main street         at 49

我不确定如何做到这一点,但我也想知道这是否可行。我可以访问执行预处理的函数,因此修改它们以更快地返回这些值,或者是稍后创建差异的可扩展方法。我更喜欢后者。

1 个答案:

答案 0 :(得分:4)

选项1
似乎是按顺序迭代替换。您可以使用列表理解

来做到最好:

lambda

鉴于此问题的局限性(矢量化替换操作所涉及的困难),我认为这是您最快的选择。

选项2
或者,f = np.vectorize(lambda i, j: i.replace(j, '')) df['difference'] = f(df.raw, df.preprocessed) df raw preprocessed difference 0 helpersstreet 46, second floor helpersstreet 46 , second floor 1 489 john doe route john doe route 489 2 at main street 49 main street at 49 一个apply

df['difference'] = df.apply(lambda x: x.raw.replace(x.preprocessed, ''), 1) 

df
                              raw      preprocessed      difference
0  helpersstreet 46, second floor  helpersstreet 46  , second floor
1              489 john doe route    john doe route            489 
2               at main street 49       main street          at  49

请注意,这只会隐藏循环,它与选项1 一样快/慢,如果不是更糟。

选项3
使用df = pd.concat([df] * 10000, ignore_index=True) # setup ,我不建议:

# Option 1
%timeit df['difference'] = [i.replace(j, '') for i, j in zip(df.raw, df.preprocessed)]
186 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Option 2
%timeit df['difference'] = f(df.raw, df.preprocessed)  
326 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Option 3
%timeit df['difference'] = df.apply(lambda x: x.raw.replace(x.preprocessed, ''), 1) 
20.8 s ± 237 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

这个隐藏了循环,但代价是比选项2 更多的开销。

<强>计时
应我的朋友jezrael先生的要求:

{{1}}

{{1}}