按照特定模式

时间:2018-04-30 19:51:09

标签: python string pandas dataframe split

请原谅我的熊猫新手问题,但我有一列美国城镇和州,例如下面显示的截断版本(由于一些奇怪的原因,该列的名称被称为' Alabama [编辑]& #39;与列中的前0-7个城镇值相关联:

0                          Auburn (Auburn University)[1]
1                 Florence (University of North Alabama)
2        Jacksonville (Jacksonville State University)[2]
3             Livingston (University of West Alabama)[2]
4               Montevallo (University of Montevallo)[2]
5                              Troy (Troy University)[2]
6      Tuscaloosa (University of Alabama, Stillman Co...
7                      Tuskegee (Tuskegee University)[5]
8                                           Alaska[edit]
9          Fairbanks (University of Alaska Fairbanks)[2]
10                                         Arizona[edit]
11            Flagstaff (Northern Arizona University)[6]
12                      Tempe (Arizona State University)
13                        Tucson (University of Arizona)
14                                        Arkansas[edit]
15     Arkadelphia (Henderson State University, Ouach...
16     Conway (Central Baptist College, Hendrix Colle...
17              Fayetteville (University of Arkansas)[7]
18              Jonesboro (Arkansas State University)[8]
19            Magnolia (Southern Arkansas University)[2]
20     Monticello (University of Arkansas at Monticel...
21            Russellville (Arkansas Tech University)[2]
22                        Searcy (Harding University)[5]
23                                      California[edit]

每个州的城镇都在每个州名下面,例如:费尔班克斯(第9列)是阿拉斯加州的一个小镇。

我想要做的是根据州名拆分城镇名称,以便我有两列'州'和' RegionName'其中每个州名称与每个城镇名称相关联,如下所示:

                            RegionName                       State
0                          Auburn (Auburn University)[1]    Alabama
1                 Florence (University of North Alabama)    Alabama
2        Jacksonville (Jacksonville State University)[2]    Alabama
3             Livingston (University of West Alabama)[2]    Alabama
4               Montevallo (University of Montevallo)[2]    Alabama
5                              Troy (Troy University)[2]    Alabama
6      Tuscaloosa (University of Alabama, Stillman Co...    Alabama
7                      Tuskegee (Tuskegee University)[5]    Alabama

8         Fairbanks (University of Alaska Fairbanks)[2]     Alaska

9         Flagstaff (Northern Arizona University)[6]        Arizona
10                      Tempe (Arizona State University)    Arizona
11                        Tucson (University of Arizona)    Arizona                                              

12        Arkadelphia (Henderson State University, Ouach... Arkansas                                           

。 。 。等等。

我知道每个州名后跟一个字符串' [edit]',我认为我可以使用它来分割和分配城镇名称。但我不知道该怎么做。

此外,我知道我需要做很多其他数据清理工作,例如删除括号内和括号内的字符串' []'。这可以在以后完成......重要的是分裂州和城镇,并将每个城镇分配到适当的美国。任何建议都将受到最高的赞赏。

1 个答案:

答案 0 :(得分:2)

如果没有太多的上下文或访问您的数据,我会建议这些内容。首先,修改读取数据的代码:

df = pd.read_csv(..., header=None, names=['RegionName']) 
# add header=False so as to read the first row as data

现在,使用str.extract提取状态名称,只有在子字符串“[edit]”后面才能提取名称。然后,您可以使用ffill转发所有NaN值。

df['State'] = df['RegionName'].str.extract(
    r'(?P<State>.*)(?=\s*\[edit\])'
).ffill()