使用python pandas与iterrows斗争

时间:2017-02-05 12:48:43

标签: python pandas

我有一个数据帧混合状态和区域在一起。 那些值[edit]意味着美国的州。

    RegionName
0   Alabama[edit]
1   Auburn [1]
2   Florence
3   Jacksonville [2]
4   Livingston [2]
5   Montevallo [2]
6   Troy [2]
7   Tuscaloosa [3][4]
8   Tuskegee [5]
9   Alaska[edit]    

我想要的结果是

    State               RegionName
0   Alabama[edit]       Auburn[1]
1                       Florence
2                       Jacksonville [2]
3                          ...
4   Alaska[edit]           ...   

我尝试使用下面的代码,但失败了

for row in df.iterrows():
    if row['RegionName'][-6:] == '[edit]':
        row['state'] = row[:-6]

错误消息是

TypeError: tuple indices must be integers or slices, not str

有人可以给我一些建议吗?感谢

1 个答案:

答案 0 :(得分:3)

您可以使用mask,选择最后六个字符indexing with str

mask = df.RegionName.str[-6:] != '[edit]'
print (mask)
0    False
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9    False
Name: RegionName, dtype: bool

#filter by mask and replace NaN by forward filling
df['State'] = df.RegionName.mask(mask).ffill()
#remove same values in both columns
df = df[df.State != df.RegionName]
print (df)
          RegionName          State
1         Auburn [1]  Alabama[edit]
2           Florence  Alabama[edit]
3   Jacksonville [2]  Alabama[edit]
4     Livingston [2]  Alabama[edit]
5     Montevallo [2]  Alabama[edit]
6           Troy [2]  Alabama[edit]
7  Tuscaloosa [3][4]  Alabama[edit]
8       Tuskegee [5]  Alabama[edit]
#keep only first duplicates, another replace by empty string
df['State'] = df.State.mask(df.State.duplicated(), '')
#change order of columns
df = df[['State','RegionName']].reset_index(drop=True)
print (df)
           State         RegionName
0  Alabama[edit]         Auburn [1]
1                          Florence
2                  Jacksonville [2]
3                    Livingston [2]
4                    Montevallo [2]
5                          Troy [2]
6                 Tuscaloosa [3][4]
7                      Tuskegee [5]

但是如果需要删除[]并且数字可以使用位修改answer

df.insert(0, 'State', df['RegionName'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['RegionName'].str.contains('\[edit\]')].reset_index(drop=True)
#change ( to [
df['RegionName'] = df['RegionName'].str.replace(r' \[.+$', '')
print (df)
     State    RegionName
0  Alabama        Auburn
1  Alabama      Florence
2  Alabama  Jacksonville
3  Alabama    Livingston
4  Alabama    Montevallo
5  Alabama          Troy
6  Alabama    Tuscaloosa
7  Alabama      Tuskegee

df['State'] = df.State.mask(df.State.duplicated(), '')
print (df)
     State    RegionName
0  Alabama        Auburn
1               Florence
2           Jacksonville
3             Livingston
4             Montevallo
5                   Troy
6             Tuscaloosa
7               Tuskegee

通过评论编辑:

如果需要非常慢的循环解决方案,则存在多个问题:

#add i for index value else get tuples
for i, row in df.iterrows():
    print (row)
    if row['RegionName'][-6:] == '[edit]':
        #for appending new column with values use loc 
        df.loc[i, 'state'] = row['RegionName'][:-6]

print (df)
         RegionName    state
0     Alabama[edit]  Alabama
1        Auburn [1]      NaN
2          Florence      NaN
3  Jacksonville [2]      NaN
4    Livingston [2]      NaN
5    Montevallo [2]      NaN
相关问题