熊猫:优化,删除循环

时间:2019-05-07 01:53:34

标签: python pandas

我正在处理一组需要清理的数据,大约 400.000行

要执行的两个操作:

  1. 转售发票月为对象'M201705'。我想创建一个名为'Year'的列,其中只包含年份2017

  2. 一些也是对象的商业产品,其结尾为'TR'。我想从这些产品中删除TR 。例如,对于'M23065TR',我想将所有产品都更改为'M23065',但是在该列中,还有已经很好的产品名称,例如'M340767''M34TR32',应该保持不变。

您可以在下面找到我的尝试:

#First case
for i in range(Ndata.shape[0]):    
    Ndata['Year'][i] = str(Ndata['Resale Invoice Month'][i])[1:5]
#A loop takes too much time
#Tried that also : 
NData['Year'] = Ndata.str['Resale Invoice Month'][1:5]
#Error : Str is not an attribute of dataframe

for i in range(Ndata.shape[0]):
    if (Ndata['Commercial Product Code'][i][-2:]=='TR')==True:
        Ndata.loc[i,'Commercial Product Code']=Ndata.loc[i,'Commercial Product Code'][:-2]
#same issue is a loop

#I was advice to do that : 
idx = Ndata[Ndata['Commercial Product Code'].str[-2:]=='TR']
Ndata.loc[idx, 'Commercial Product Code'] = Ndata[idx]['Commercial Product Code'].str[:-2]
#It doesn't work as well

3 个答案:

答案 0 :(得分:3)

要使用1-4个字符来表示年份,请使用Series.str[indices]

Ndata['Year'] = Ndata['Resale Invoice Month'].str[1:5]

要从字符串末尾删除'TR',请使用Series.str.replace。这里$匹配字符串的结尾:

Ndata['Commercial Product Code'] = Ndata['Commercial Product Code'].str.replace('TR$', '')

答案 1 :(得分:0)

我相信这就是您想要的:

# get the 2nd, 3rd, 4th and 5th characters of Ndata[Resale Invoice Month]

Ndata['Year'] = Ndata['Resale Invoice Month'].str[1:5].astype(int)

# remove the last two characters if they are TR

Ndata.loc[Ndata['Commercial Product Code'].str[-2:] == 'TR', 'Commercial Product Code'] = Ndata['Commercial Product Code'].str[:-2]

答案 2 :(得分:0)

或者是将replaceregex=True一起使用的单线:

Ndata['Year'] = Ndata['Resale Invoice Month'].str[1:5].replace('TR', '', regex=True)

现在:

print(df)

将符合预期。