我有一个数据帧混合状态和区域在一起。 那些值[edit]意味着美国的州。
RegionName
0 Alabama[edit]
1 Auburn [1]
2 Florence
3 Jacksonville [2]
4 Livingston [2]
5 Montevallo [2]
6 Troy [2]
7 Tuscaloosa [3][4]
8 Tuskegee [5]
9 Alaska[edit]
我想要的结果是
State RegionName
0 Alabama[edit] Auburn[1]
1 Florence
2 Jacksonville [2]
3 ...
4 Alaska[edit] ...
我尝试使用下面的代码,但失败了
for row in df.iterrows():
if row['RegionName'][-6:] == '[edit]':
row['state'] = row[:-6]
错误消息是
TypeError: tuple indices must be integers or slices, not str
有人可以给我一些建议吗?感谢
答案 0 :(得分:3)
您可以使用mask
,选择最后六个字符indexing with str:
mask = df.RegionName.str[-6:] != '[edit]'
print (mask)
0 False
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 False
Name: RegionName, dtype: bool
#filter by mask and replace NaN by forward filling
df['State'] = df.RegionName.mask(mask).ffill()
#remove same values in both columns
df = df[df.State != df.RegionName]
print (df)
RegionName State
1 Auburn [1] Alabama[edit]
2 Florence Alabama[edit]
3 Jacksonville [2] Alabama[edit]
4 Livingston [2] Alabama[edit]
5 Montevallo [2] Alabama[edit]
6 Troy [2] Alabama[edit]
7 Tuscaloosa [3][4] Alabama[edit]
8 Tuskegee [5] Alabama[edit]
#keep only first duplicates, another replace by empty string
df['State'] = df.State.mask(df.State.duplicated(), '')
#change order of columns
df = df[['State','RegionName']].reset_index(drop=True)
print (df)
State RegionName
0 Alabama[edit] Auburn [1]
1 Florence
2 Jacksonville [2]
3 Livingston [2]
4 Montevallo [2]
5 Troy [2]
6 Tuscaloosa [3][4]
7 Tuskegee [5]
但是如果需要删除[]
并且数字可以使用位修改answer:
df.insert(0, 'State', df['RegionName'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['RegionName'].str.contains('\[edit\]')].reset_index(drop=True)
#change ( to [
df['RegionName'] = df['RegionName'].str.replace(r' \[.+$', '')
print (df)
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
df['State'] = df.State.mask(df.State.duplicated(), '')
print (df)
State RegionName
0 Alabama Auburn
1 Florence
2 Jacksonville
3 Livingston
4 Montevallo
5 Troy
6 Tuscaloosa
7 Tuskegee
通过评论编辑:
如果需要非常慢的循环解决方案,则存在多个问题:
#add i for index value else get tuples
for i, row in df.iterrows():
print (row)
if row['RegionName'][-6:] == '[edit]':
#for appending new column with values use loc
df.loc[i, 'state'] = row['RegionName'][:-6]
print (df)
RegionName state
0 Alabama[edit] Alabama
1 Auburn [1] NaN
2 Florence NaN
3 Jacksonville [2] NaN
4 Livingston [2] NaN
5 Montevallo [2] NaN