Question

我有一个包含一列的示例数据库：

import pandas as pd
d = {

 'question#': ['a1.2','a10','a10.1','b11.1a','k20.3d','b20c']
}
df = pd.DataFrame(d)

它看起来像这样：

Out[8]: 
question#
0       a1.2
1       a10
2       a10.1
3       b11.1a
4       k20.3d
5       b20c

没有任何方法可以正确排序数字和字母混合列，所以我认为唯一的方法是首先将列拆分为3列：

第一栏：一封信：（a-z），字符串始终以一个字母开头

第二栏：两种可能的结果：

单个数字或多个数字：（1-9）+

或者
数字+'。' +数字：（1-9）+（/。）（1-9）+

第三栏：一封信或没有：（a-z）？

因此，对于示例数据库，我希望将其拆分为以下列，期望的输出：

Out[8]: 
question#  firstcol   secondcol    thirdcol
0             a         1.2
1             a         10
2             a         10.1
3             b         11.1           a
4             k         20.3           d
5             b         20             c

是这样的语法吗？我不确定如何编写正则表达式语法：

https://chrisalbon.com/python/pandas_regex_to_create_columns.html

  df['firstcol'] = df['question#'].str.extract(not sure the syntax, expand=True)
  df['secondcol'] = df['question#'].str.extract(not sure the syntax, expand=True)
  df['thirdcol'] = df['question#'].str.extract(not sure the syntax, expand=True)

Answer 1

尝试

df[['firstcol', 'secondcol', 'thirdcol']] = df['question#'].str.extract('([A-Za-z]+)(\d+\.?\d*)([A-Za-z]*)', expand = True)


    question#   firstcol    secondcol   thirdcol
0   a1.2        a           1.2 
1   a10         a           10  
2   a10.1       a           10.1    
3   b11.1a      b           11.1        a
4   k20.3d      k           20.3        d
5   b20c        b           20          c

基于模式的大熊猫分裂字母和数字混合列

1 个答案: