Python正则表达式由单词和大写字母分隔,但不包括数字

时间:2018-03-08 16:30:48

标签: regex python-3.x

我正在尝试分析一些CV数据,需要对不同的部分进行标记。 当我得到数据(通过美丽的汤)时,它出现如下:

['Middlesex UniversityMA HRMMA HRM2012  –  2014', 'Ryerson UniversityBachelor of CommerceBachelor of Commerce1999  –  2003']


['Program Manager, Global Career DevelopmentHult International Business SchoolAugust 2014  –  January 2017 (2 years 6 months)', 'Director, Career ServicesHult International Business SchoolMarch 2012  –  August 2014 (2 years 6 months)', "Training & Development ManagerWalmartOctober 2006  –  February 2011 (4 years 5 months)• Built management's Leadership and Operations capability through the Retail Academy and field training.", 'Co-Owner/DirectorThai DelightFebruary 2003  –  July 2007 (4 years 6 months)• Developed and executed business strategy, marketing and sales initiatives • Managed all financial statements and reporting • Recruited and trained staff on food safety and customer service', 'Assistant Store ManagerWalmartJune 2003  –  October 2006 (3 years 5 months)• Drove profitable sales in a high volume store through the management of people, operations and merchandise.']

所以,我试图将它与正则表达式分开,这就是我到目前为止所做的,以及我真正陷入困境的地方:

import re
string = ''.join(schools)
split = re.findall('[A-Z]+[^A-Z]+', string)
split_string = ''.join(split)
print(split)

给了我这个:

['Middlesex ', 'University', 'MA ', 'HRMMA ', 'HRM2012  –  2014',
'Ryerson ', 'University', 'Bachelor of ', 'Commerce', 'Bachelor of ', 'Commerce1999  –  2003']

我正试图解决这个问题:

['Middlesex ', 'University', 'MA ', 'HRMMA ', 'HRM', '2012', '2014', 'Ryerson ', 'University', 'Bachelor of ', 'Commerce', 'Bachelor of ', 'Commerce', '1999', '2003']

或此输出:

['Middlesex ', 'University', 'MA ', 'HRMMA ', 'HRM', 'Ryerson ', 'University', 'Bachelor of ', 'Commerce', 'Bachelor of ', 'Commerce']

有人可以帮我吗?提前谢谢!

1 个答案:

答案 0 :(得分:2)

具有特定正则表达式模式的

re.findall() 解决方案:

import re

s = "Middlesex UniversityMA HRMMA HRM2012  –  2014', 'Ryerson UniversityBachelor of CommerceBachelor of Commerce1999  –  2003"

result = re.findall(r'([A-Z]{2,}|[A-Z][a-z]+(?: of)?|[0-9]+)', s)
print(result)

输出:

['Middlesex', 'University', 'MA', 'HRMMA', 'HRM', '2012', '2014', 'Ryerson', 'University', 'Bachelor of', 'Commerce', 'Bachelor of', 'Commerce', '1999', '2003']
  • (...|...|...) - 正则表达式替换组
  • [A-Z]{2,} - 在2和无限次之间匹配,尽可能多次匹配A(索引65)和Z(索引90)之间的范围内的字符(区分大小写)
  • [A-Z][a-z]+(?: of)? - 匹配A到Z范围内的单个字符,后跟a到z和可选介词of
  • 之间范围内的一个或多个字符
  • [0-9]+ - 匹配一个或多个数字