清理数据帧比loc更有效的方法

时间:2018-05-03 18:20:12

标签: python pandas dataframe

我的代码如下:

import pandas as pd
df = pd.read_excel("Energy Indicators.xls", header=None, footer=None)
c_df = df.copy()
c_df = c_df.iloc[18:245, 2:]
c_df = c_df.rename(columns={2: 'Country', 3: 'Energy Supply', 4:'Energy Supply per Capita', 5:'% Renewable'})
c_df['Energy Supply'] = c_df['Energy Supply'].apply(lambda x: x*1000000)
c_df.loc[c_df['Country'] == 'Korea, Rep.'] = 'South Korea'
c_df.loc[c_df['Country'] == 'United States of America20'] = 'United States'
c_df.loc[c_df['Country'] == 'United Kingdom of Great Britain and Northern Ireland'] = 'United Kingdom'
c_df.loc[c_df['Country'] == 'China, Hong Kong Special Administrative Region'] = 'Hong Kong'
c_df.loc[c_df['Country'] == 'Venezuela (Bolivarian Republic of)'] = 'Venezuela'
c_df.loc[c_df['Country'] == 'Bolivia (Plurinational State of)'] = 'Bolivia'
c_df.loc[c_df['Country'] == 'Switzerland17'] = 'Switzerland'
c_df.loc[c_df['Country'] == 'Australia1'] = 'Australia'
c_df.loc[c_df['Country'] == 'China2'] = 'China'
c_df.loc[c_df['Country'] == 'Falkland Islands (Malvinas)'] = 'Bolivia'
c_df.loc[c_df['Country'] == 'Greenland7'] = 'Greenland'
c_df.loc[c_df['Country'] == 'Iran (Islamic Republic of'] = 'Iran'
c_df.loc[c_df['Country'] == 'Italy9'] = 'Italy'
c_df.loc[c_df['Country'] == 'Japan10'] = 'Japan'
c_df.loc[c_df['Country'] == 'Kuwait11'] = 'Kuwait'
c_df.loc[c_df['Country'] == 'Micronesia (Federal States of)'] = 'Micronesia'
c_df.loc[c_df['Country'] == 'Netherlands12'] = 'Netherlands'
c_df.loc[c_df['Country'] == 'Portugal13'] = 'Portugal'
c_df.loc[c_df['Country'] == 'Saudi Arabia14'] = 'Saudi Arabia'
c_df.loc[c_df['Country'] == 'Serbia15'] = 'Serbia'
c_df.loc[c_df['Country'] == 'Sint Maarteen (Dutch part)'] = 'Sint Marteen'
c_df.loc[c_df['Country'] == 'Spain16'] = 'Spain'
c_df.loc[c_df['Country'] == 'Ukraine18'] = 'Ukraine'
c_df.loc[c_df['Country'] == 'Denmark5'] = 'Denmark'
c_df.loc[c_df['Country'] == 'France6'] = 'France'
c_df.loc[c_df['Country'] == 'Indonesia8'] = 'Indonesia'

我觉得必须有一种更简单的方法来更改名称中包含括号和数字的国家/地区的值。我可以使用哪种pandas方法在列中查找带括号数字的名称? isin

2 个答案:

答案 0 :(得分:5)

您可以从括号中删除数字和文字开始。之后,对于需要进行非平凡替换的所有其他内容,请声明地图并使用pd.Series.replace应用它。

mapper = {'Korea, Rep' : 'South Korea', 'Falkland Islands' : 'Bolivia', ...} 

df['Country'] = (
    df['Country'].str.replace(r'\d+|\s*\(.*\)', '').str.strip().replace(mapper)
)

足够简单,完成。

<强>详情

\d+     # one or more digits
|       # regex OR pipe
\s*     # zero or more whitespace characters
\(      # literal parentheses (opening brace)
.*      # match anything 
\)      # closing brace

答案 1 :(得分:3)

使用字典然后df.replace

dict_to_replace = {'Korea, Rep.':'South Korea',
                         'United States of America20':'United States',
                         'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom'
                   ...}

df['c_df'] = df['c_df'].replace(dict_to_replace)