正则表达式从文本中提取标题

时间:2016-09-28 15:04:09

标签: python regex python-2.7 text-extraction

任何人都可以帮助正则表达式从以下文本中“标题:”之后提取文本短语:(刚刚加粗文本以清楚地描绘要提取的部分)

Title: Anorectal Fistula (Fistula-in-Ano) Procedure Code(s): 



Effective date: 7/1/07

Title:

2003247 

or previous effective dates) 



Title: 

ST2 Assay for Chronic Heart Failure 

Description/Background 

Heart Failure 

HF is one among many cardiovascular diseases that comprises a major cause of morbidity 
and mortality worldwide. The term “heart failure” (HF) refers to a complex clinical syndrome .

我正在使用正则表达式:(?:Title: \n+(.*))|(?:Title:\n+(.*))|(?<=Title: )(.*)(?=Procedure)

然而,它似乎没有正确捕获这些术语!我使用的是Python 2.7.12

1 个答案:

答案 0 :(得分:0)

我建议使用

 Title:\s*(.*?)\s*Procedure|Title:\s*(.*)

请参阅regex demo

详细

  • Title: - 文字Title:
  • \s* - 0+ whitespaces
  • (.*?) - 第1组:除了换行符号之外的任何0 +字符,尽可能少到第一个字符
  • \s*Procedure - 0+空格+字符串Procedure
  • | - 或
  • Title:\s* - Title: string + 0+ whitespaces
  • (.*) - 第2组:尽可能多地使用除了换行符号之外的任何字符零(或其余部分)。

Python code

import re
regex = r"Title:\s*(.*?)\s*Procedure|Title:\s*(.*)"
test_str = ("Title: Anorectal Fistula (Fistula-in-Ano) Procedure Code(s):\n\n"
    "Effective date: 7/1/07\n\n"
    "Title:\n\n"
    "2003247\n\n"
    "or previous effective dates)\n\n"
    "Title:\n\n"
    "ST2 Assay for Chronic Heart Failure\n\n"
    "Description/Background\n\n"
    "Heart Failure\n\n"
    "HF is one among many cardiovascular diseases that comprises a major cause of morbidity and mortality worldwide. The term “heart failure” (HF) refers to a complex clinical syndrome .")
res = []
for m in re.finditer(regex, test_str):
    if m.group(1):
        res.append(m.group(1))
    else:
        res.append(m.group(2))
print(res)
# => ['Anorectal Fistula (Fistula-in-Ano)', '2003247', 'ST2 Assay for Chronic Heart Failure']