Question

我正在创建一个列，以便为某些字符串添加标记，并在此处输入代码：

import pandas as pd
import numpy as np
import re

data=pd.DataFrame({'Lang':["Python", "Cython", "Scipy", "Numpy", "Pandas"], })
data['Type'] = ""


pat = ["^P\w", "^S\w"]

for i in range (len(data.Lang)):
    if re.search(pat[0],data.Lang.ix[i]):
        data.Type.ix[i] = "B"

    if re.search(pat[1],data.Lang.ix[i]):
        data.Type.ix[i]= "A"


print data

有没有办法摆脱for循环？如果它是numpy，则有一个类似于我想要找到的函数arange。

Answer 1

这将比apply soln（和循环soln）

更快

仅供参考:(这是0.13）。在0.12中，您需要先创建Type列。

In [36]: data.loc[data.Lang.str.match(pat[0]),'Type'] = 'B'

In [37]: data.loc[data.Lang.str.match(pat[1]),'Type'] = 'A'

In [38]: data
Out[38]: 
     Lang Type
0  Python    B
1  Cython  NaN
2   Scipy    A
3   Numpy  NaN
4  Pandas    B

[5 rows x 2 columns]

In [39]: data.fillna('')
Out[39]: 
     Lang Type
0  Python    B
1  Cython     
2   Scipy    A
3   Numpy     
4  Pandas    B

[5 rows x 2 columns]

这是一些时间：

In [34]: bigdata = pd.concat([data]*2000,ignore_index=True)

In [35]: def f3(df):
    df = df.copy()
    df['Type'] = ''
    for i in range(len(df.Lang)):
        if re.search(pat[0],df.Lang.ix[i]):
            df.Type.ix[i] = 'B'
        if re.search(pat[1],df.Lang.ix[i]):
            df.Type.ix[i] = 'A'
   ....:             

In [36]: def f2(df):
    df = df.copy()
    df.loc[df.Lang.str.match(pat[0]),'Type'] = 'B'
    df.loc[df.Lang.str.match(pat[1]),'Type'] = 'A'
    df.fillna('')
   ....:     

In [37]: def f1(df):
    df = df.copy()
    f = lambda s: re.match(pat[0], s) and 'A' or re.match(pat[1], s) and 'B' or ''
    df['Type'] = df['Lang'].apply(f)
   ....:

你原来的解决方案

In [41]: %timeit f3(bigdata)
1 loops, best of 3: 2.21 s per loop

直接索引

In [42]: %timeit f2(bigdata)
100 loops, best of 3: 17.3 ms per loop

应用

In [43]: %timeit f1(bigdata)
10 loops, best of 3: 21.8 ms per loop

这是另一种更通用的方法，速度更快，而且prob更有用然后你可以根据需要将模式组合成一个组合。

In [107]: pats
Out[107]: {'A': '^P\\w', 'B': '^S\\w'}

In [108]: concat([df,DataFrame(dict([ (c,Series(c,index=df.index)[df.Lang.str.match(p)].reindex(df.index)) for c,p in pats.items() ]))],axis=1)
Out[108]: 
      Lang    A    B
0   Python    A  NaN
1   Cython  NaN  NaN
2    Scipy  NaN    B
3    Numpy  NaN  NaN
4   Pandas    A  NaN
5   Python    A  NaN
6   Cython  NaN  NaN

45  Python    A  NaN
46  Cython  NaN  NaN
47   Scipy  NaN    B
48   Numpy  NaN  NaN
49  Pandas    A  NaN
50  Python    A  NaN
51  Cython  NaN  NaN
52   Scipy  NaN    B
53   Numpy  NaN  NaN
54  Pandas    A  NaN
55  Python    A  NaN
56  Cython  NaN  NaN
57   Scipy  NaN    B
58   Numpy  NaN  NaN
59  Pandas    A  NaN
       ...  ...  ...

[10000 rows x 3 columns]

In [106]: %timeit  concat([df,DataFrame(dict([ (c,Series(c,index=df.index)[df.Lang.str.match(p)].reindex(df.index)) for c,p in pats.items() ]))],axis=1)
100 loops, best of 3: 15.5 ms per loop

这个框架针对每个将字母置于正确位置的图案（而NaN否则为）进行系列处理。

创建一系列该字母

Series(c,index=df.index)

从中选择匹配

Series(c,index=df.index)[df.Lang.str.match(p)]

重新索引将NaN放在值不在索引中的位置

Series(c,index=df.index)[df.Lang.str.match(p)].reindex(df.index))

Answer 2

您可以使用一个lambda进行两种分类：

f = lambda s: re.match(pat[0], s) and 'A' or re.match(pat[1], s) and 'B' or ''

然后使用apply获取“类型”

data.Type = data.Lang.apply(f)

输出：

     Lang Type
0  Python    A
1  Cython
2   Scipy    B
3   Numpy
4  Pandas    A

编辑：基准测试后可能没有比较好。如果你想加快速度而不是避免编译正则表达式

def f1(df):
    df = df.copy()
    f = lambda s: re.match(pat[0], s) and 'A' or re.match(pat[1], s) and 'B' or ''
    df['Type'] = df['Lang'].apply(f)
    return df

def f1_1(df):
    df = df.copy()
    re1, re2 = re.compile(pat[0]), re.compile(pat[1])
    f = lambda s: re1.match(s) and 'A' or re2.match(s) and 'B' or ''
    df.Type = df.Lang.apply(f)
    return df

bigdata = pd.concat([data]*2000,ignore_index=True)

原始申请：

In [18]:  %timeit f1(bigdata)
10 loops, best of 3: 23.1 ms per loop

修改申请：

In [19]: %timeit f1_1(bigdata)
100 loops, best of 3: 6.65 ms per loop

Python在没有for循环的情况下更有效地迭代pandas

2 个答案: