我有一个pandas数据框,我想创建一个新列,其中包含列中包含字符串的子字符串。
例如。 “race”列包含单词“2016_Lap_JAPANESE_Third_Times.csv”,我想提取“日语”一词。
我现在采取的一种方法是比较单词是否在列表中,如果是,则将该值输入新列。
race_names = ['japanese'] -> i have along list of elements in this listand and multiple names in "race" column.
for i,row in df_fp2.iterrows():
for name in race_names:
if name in df_fp2.loc[i,'race']:
df_fp2.loc[i,'name'] = str(name) + " Grand Prix"
Df转换为字典。
{'driverRef': {151: 'button',
152: 'button',
153: 'button',
154: 'button',
155: 'button'},
'driver_no': {151: 22, 152: 22, 153: 22, 154: 22, 155: 22},
'milliseconds': {151: 1339994.0,
152: 692245.0,
153: 96286.0,
154: 94547.999999999985,
155: 114725.0},
'name': {151: 'J.BUTTON',
152: 'J.BUTTON',
153: 'J.BUTTON',
154: 'J.BUTTON',
155: 'J.BUTTON'},
'race': {151: '2016_Lap_JAPANESE_Third_Times.csv',
152: '2016_Lap_JAPANESE_Third_Times.csv',
153: '2016_Lap_JAPANESE_Third_Times.csv',
154: '2016_Lap_JAPANESE_Third_Times.csv',
155: '2016_Lap_JAPANESE_Third_Times.csv'},
'time': {151: 1339.9939999999999,
152: 692.245,
153: 96.286000000000001,
154: 94.547999999999988,
155: 114.72499999999999}}
这是df“race”栏中的一系列独特元素,因为单词的排列方式不同,我不能简单地删除每个国家/地区名称前后的单词。
array(['2016_Lap_ABU_Third_Times.csv', '2016_Lap_BRASIL_Third_Times.csv',
'2016_Lap_CHINESE_Third_Times.csv',
'2016_Lap_JAPANESE_Third_Times.csv',
'2016_Lap_MAGYAR_Third_Times.csv',
'2016_Lap_SINGAPORE_Third_Times.csv', '2016_Lap_Third_Times.csv',
'2016_Lap_UNITED_Third_Times.csv',
'AUSTRALIAN_2016_Lap_Third_Times.csv',
'BAHRAIN_2016_Lap_Third_Times.csv',
'BELGIAN_2016_Lap_Third_Times.csv',
'CANADA_2016_Lap_Third_Times.csv',
'ESPANA_2016_Lap_Third_Times.csv',
'EUROPE_2016_Lap_Third_Times.csv',
'MALAYSIA_2016_Lap_Third_Times.csv',
'Mexico_2016_Lap_Third_Times.csv',
'RUSSIAN_2016_Lap_Third_Times.csv'], dtype=object)
答案 0 :(得分:0)
如果在race_names
中所有可能的提取词都使用str.extract
:
import re
race_names = ['japanese']
pat = '|'.join(r"{}".format(x) for x in race_names)
df['name'] = df['race'].str.extract('('+ pat + ')', expand=False, flags=re.I) + " Grand Prix"
print (df)
driverRef driver_no milliseconds name \
151 button 22 1339994.0 JAPANESE Grand Prix
152 button 22 692245.0 JAPANESE Grand Prix
153 button 22 96286.0 JAPANESE Grand Prix
154 button 22 94548.0 JAPANESE Grand Prix
155 button 22 114725.0 JAPANESE Grand Prix
race time
151 2016_Lap_JAPANESE_Third_Times.csv 1339.994
152 2016_Lap_JAPANESE_Third_Times.csv 692.245
153 2016_Lap_JAPANESE_Third_Times.csv 96.286
154 2016_Lap_JAPANESE_Third_Times.csv 94.548
155 2016_Lap_JAPANESE_Third_Times.csv 114.725
df = pd.DataFrame({'race':['2016_Lap_ABU_Third_Times.csv', '2016_Lap_BRASIL_Third_Times.csv',
'2016_Lap_CHINESE_Third_Times.csv',
'2016_Lap_JAPANESE_Third_Times.csv',
'2016_Lap_MAGYAR_Third_Times.csv',
'2016_Lap_SINGAPORE_Third_Times.csv', '2016_Lap_Third_Times.csv',
'2016_Lap_UNITED_Third_Times.csv',
'AUSTRALIAN_2016_Lap_Third_Times.csv',
'BAHRAIN_2016_Lap_Third_Times.csv',
'BELGIAN_2016_Lap_Third_Times.csv',
'CANADA_2016_Lap_Third_Times.csv',
'ESPANA_2016_Lap_Third_Times.csv',
'EUROPE_2016_Lap_Third_Times.csv',
'MALAYSIA_2016_Lap_Third_Times.csv',
'Mexico_2016_Lap_Third_Times.csv',
'RUSSIAN_2016_Lap_Third_Times.csv']})
df['name'] = (df['race'].replace(['_Third_Times.csv','Lap', '\d+'], '', regex=True)
.str.strip('_'))
print (df)
race name
0 2016_Lap_ABU_Third_Times.csv ABU
1 2016_Lap_BRASIL_Third_Times.csv BRASIL
2 2016_Lap_CHINESE_Third_Times.csv CHINESE
3 2016_Lap_JAPANESE_Third_Times.csv JAPANESE
4 2016_Lap_MAGYAR_Third_Times.csv MAGYAR
5 2016_Lap_SINGAPORE_Third_Times.csv SINGAPORE
6 2016_Lap_Third_Times.csv
7 2016_Lap_UNITED_Third_Times.csv UNITED
8 AUSTRALIAN_2016_Lap_Third_Times.csv AUSTRALIAN
9 BAHRAIN_2016_Lap_Third_Times.csv BAHRAIN
10 BELGIAN_2016_Lap_Third_Times.csv BELGIAN
11 CANADA_2016_Lap_Third_Times.csv CANADA
12 ESPANA_2016_Lap_Third_Times.csv ESPANA
13 EUROPE_2016_Lap_Third_Times.csv EUROPE
14 MALAYSIA_2016_Lap_Third_Times.csv MALAYSIA
15 Mexico_2016_Lap_Third_Times.csv Mexico
16 RUSSIAN_2016_Lap_Third_Times.csv RUSSIAN