我正在尝试分析一些CV数据,需要对不同的部分进行标记。 当我得到数据(通过美丽的汤)时,它出现如下:
['Middlesex UniversityMA HRMMA HRM2012 – 2014', 'Ryerson UniversityBachelor of CommerceBachelor of Commerce1999 – 2003']
['Program Manager, Global Career DevelopmentHult International Business SchoolAugust 2014 – January 2017 (2 years 6 months)', 'Director, Career ServicesHult International Business SchoolMarch 2012 – August 2014 (2 years 6 months)', "Training & Development ManagerWalmartOctober 2006 – February 2011 (4 years 5 months)• Built management's Leadership and Operations capability through the Retail Academy and field training.", 'Co-Owner/DirectorThai DelightFebruary 2003 – July 2007 (4 years 6 months)• Developed and executed business strategy, marketing and sales initiatives • Managed all financial statements and reporting • Recruited and trained staff on food safety and customer service', 'Assistant Store ManagerWalmartJune 2003 – October 2006 (3 years 5 months)• Drove profitable sales in a high volume store through the management of people, operations and merchandise.']
所以,我试图将它与正则表达式分开,这就是我到目前为止所做的,以及我真正陷入困境的地方:
import re
string = ''.join(schools)
split = re.findall('[A-Z]+[^A-Z]+', string)
split_string = ''.join(split)
print(split)
给了我这个:
['Middlesex ', 'University', 'MA ', 'HRMMA ', 'HRM2012 – 2014',
'Ryerson ', 'University', 'Bachelor of ', 'Commerce', 'Bachelor of ', 'Commerce1999 – 2003']
我正试图解决这个问题:
['Middlesex ', 'University', 'MA ', 'HRMMA ', 'HRM', '2012', '2014', 'Ryerson ', 'University', 'Bachelor of ', 'Commerce', 'Bachelor of ', 'Commerce', '1999', '2003']
或此输出:
['Middlesex ', 'University', 'MA ', 'HRMMA ', 'HRM', 'Ryerson ', 'University', 'Bachelor of ', 'Commerce', 'Bachelor of ', 'Commerce']
有人可以帮我吗?提前谢谢!
答案 0 :(得分:2)
re.findall()
解决方案:
import re
s = "Middlesex UniversityMA HRMMA HRM2012 – 2014', 'Ryerson UniversityBachelor of CommerceBachelor of Commerce1999 – 2003"
result = re.findall(r'([A-Z]{2,}|[A-Z][a-z]+(?: of)?|[0-9]+)', s)
print(result)
输出:
['Middlesex', 'University', 'MA', 'HRMMA', 'HRM', '2012', '2014', 'Ryerson', 'University', 'Bachelor of', 'Commerce', 'Bachelor of', 'Commerce', '1999', '2003']
(...|...|...)
- 正则表达式替换组[A-Z]{2,}
- 在2和无限次之间匹配,尽可能多次匹配A(索引65)和Z(索引90)之间的范围内的字符(区分大小写)[A-Z][a-z]+(?: of)?
- 匹配A到Z范围内的单个字符,后跟a到z和可选介词of
[0-9]+
- 匹配一个或多个数字