Question

我正在尝试分析一些CV数据，需要对不同的部分进行标记。当我得到数据（通过美丽的汤）时，它出现如下：

['Middlesex UniversityMA HRMMA HRM2012  –  2014', 'Ryerson UniversityBachelor of CommerceBachelor of Commerce1999  –  2003']


['Program Manager, Global Career DevelopmentHult International Business SchoolAugust 2014  –  January 2017 (2 years 6 months)', 'Director, Career ServicesHult International Business SchoolMarch 2012  –  August 2014 (2 years 6 months)', "Training & Development ManagerWalmartOctober 2006  –  February 2011 (4 years 5 months)• Built management's Leadership and Operations capability through the Retail Academy and field training.", 'Co-Owner/DirectorThai DelightFebruary 2003  –  July 2007 (4 years 6 months)• Developed and executed business strategy, marketing and sales initiatives • Managed all financial statements and reporting • Recruited and trained staff on food safety and customer service', 'Assistant Store ManagerWalmartJune 2003  –  October 2006 (3 years 5 months)• Drove profitable sales in a high volume store through the management of people, operations and merchandise.']

所以，我试图将它与正则表达式分开，这就是我到目前为止所做的，以及我真正陷入困境的地方：

import re
string = ''.join(schools)
split = re.findall('[A-Z]+[^A-Z]+', string)
split_string = ''.join(split)
print(split)

给了我这个：

['Middlesex ', 'University', 'MA ', 'HRMMA ', 'HRM2012  –  2014',
'Ryerson ', 'University', 'Bachelor of ', 'Commerce', 'Bachelor of ', 'Commerce1999  –  2003']

我正试图解决这个问题：

['Middlesex ', 'University', 'MA ', 'HRMMA ', 'HRM', '2012', '2014', 'Ryerson ', 'University', 'Bachelor of ', 'Commerce', 'Bachelor of ', 'Commerce', '1999', '2003']

或此输出：

['Middlesex ', 'University', 'MA ', 'HRMMA ', 'HRM', 'Ryerson ', 'University', 'Bachelor of ', 'Commerce', 'Bachelor of ', 'Commerce']

有人可以帮我吗？提前谢谢！

Answer 1

具有特定正则表达式模式的

re.findall() 解决方案：

import re

s = "Middlesex UniversityMA HRMMA HRM2012  –  2014', 'Ryerson UniversityBachelor of CommerceBachelor of Commerce1999  –  2003"

result = re.findall(r'([A-Z]{2,}|[A-Z][a-z]+(?: of)?|[0-9]+)', s)
print(result)

输出：

['Middlesex', 'University', 'MA', 'HRMMA', 'HRM', '2012', '2014', 'Ryerson', 'University', 'Bachelor of', 'Commerce', 'Bachelor of', 'Commerce', '1999', '2003']

(...|...|...) - 正则表达式替换组
[A-Z]{2,} - 在2和无限次之间匹配，尽可能多次匹配A（索引65）和Z（索引90）之间的范围内的字符（区分大小写）
[A-Z][a-z]+(?: of)? - 匹配A到Z范围内的单个字符，后跟a到z和可选介词of
[0-9]+ - 匹配一个或多个数字

Python正则表达式由单词和大写字母分隔，但不包括数字

1 个答案: