Question

我使用DataFrame str.replace包装器的性能比内置函数差5倍。有谁知道是什么导致了这个？

df = pd.DataFrame({'word':['bird']*100000})
%timeit df.word.str.replace('bird','theword')
%timeit df.word.map(lambda x: x.replace('bird','theword'))
1 loops, best of 3: 266 ms per loop
10 loops, best of 3: 55.4 ms per loop

Answer 1

原因是str.replace将能够处理NaN，而自定义替换为lambda将会出错：

In [17]: df.iloc[0,0] = np.nan

In [18]: df.word.str.replace('bird','theword').head()
Out[18]:
0        NaN
1    theword
2    theword
3    theword
4    theword
Name: word, dtype: object

In [19]: df.word.map(lambda x: x.replace('bird','theword'))

AttributeError: 'float' object has no attribute 'replace'

Internally，str.replace也使用lambda x: x.replace(pat, repl, n)（如果您没有使用案例或 flags 关键字，那么它将使用正则表达式。

Pandas中矢量化字符串操作的时间

1 个答案: