Question

我正在尝试使用Regex来解析简历。我试图找到标有“教育”（或某种形式）的部分，然后使用规则定义块的结尾。

我目前有一个工作正常的正则表达式，可以找到〜word知识，并将给我其余文档以规则为基础进行解析。

这是我定义规则的完整代码

headers = ['experience','projects','work experience','skills 
summary','skills/tools']
for item in resume_paths:
    resume = getText(item)
    resume = resume.replace('\n',' \n ')
    education = re.findall(r'(?i)\w*Education\w*[^?]+', resume)[0].split('\n')
    paragraph = ''
    for line in education[1:]:
         line = line.strip()
         if (line.isupper() == False) and (not line.strip().lower() in headers):
            paragraph += line + '\n'
        else:
            break
    print(resume[:15],paragraph)

这是我正在使用的正则表达式

(?i)\w*Education\w*[^?]+

当某人多次使用“教育”一词时，我遇到了问题。我希望正则表达式将所有匹配项的列表返回到文档末尾，并将使用规则来确定哪一个是正确的。我尝试删除+号以获得多个匹配项，但这使我两个单词都匹配，而没有文档的其余部分。

谢谢！

Answer 1

您的正则表达式r'（？i）\ w Education \ w [^？] +'将找到'Education'，可以选择在两边都带有多余的字母和数字；然后将其扩展到下一个问号。 \ w将不包含空格，标点符号等。

我怀疑那是你想要的。它会得到像这样的东西：

XYZEducationismallly

但不是

Relevant Education

[^？]表示不是'？'的任何内容；但我不明白为什么您要扫描到下一个问号（或字符串结尾）。

此外，如果没有“？”左右（很可能），“ +”将把所有内容带到整个源字符串的末尾，但您可能想停在下一个标题（如果有），例如“就业历史”或其他内容。

要真正做到这一点将很困难，因为简历可能以多种不同的方式转换为文本（一个明显的例子：文本的每一行可能代表原始文本的一个“视觉”行，或者一个“段落”块，甚至是一个表格单元格（如果发起者使用表格进行布局，这很常见）。

但是，如果您停留在处理文本上，可能是一种更简单明了的方法：

eduSection = []
inEducationSection = False
for line in resume:
    if re.search(r'\bEducation', line): 
        inEducationSection = True
    elif re.search(r'\b(History|Experience|other headingish things)', line):
        inEducationSection = False
    elif inEducationSection:
        eduSection.append(line)

如果您可以稍微确定数据中的“标题”，更准确地说，您将获得更好的结果。例如：

* headings might be all caps, or title caps;
* headings might be  the only things that start in column1
* headings might have no punctuation except final ':'
* headings might be really short compared to (most) other lines
* maybe there are only a few dozen distinct headings that show up often.

我要说的第一件事是要弄清楚什么时候是标题。一旦有了，剩下的就很容易了。

匹配多个单词直到文档结尾

1 个答案: