正则表达式解析字符串

时间:2016-11-24 02:26:40

标签: python regex regex-negation

我正在努力正确地解析文本。文本中有很多变化。理想情况下,我想在Python中执行此操作,但任何语言都可以。

示例字符串:

  • "if magic code is 543, and type is 5642, 912342, or 7425, type has to have a period. EX: 02-15-99"
  • "If Magic Code is 722, 43, or 643256 and types is 43234, 5356, and 2112, type has to start with period."
  • "if magic code is 4542 it is not valid in type."
  • "if magic code is 532 and date is within 10 years from current data, and the type is 43, the type must begin with law number."

我想要的结果:

  • [543] [5642, 912342, 7425][type has to have a period.]
  • [722, 43, 643256][3234, 5356, and 2112][type has to start with period.]
  • [4542][it is not valid in type.]
  • [532][43][the type must begin with law number.]

还有其他变体,但你看到了这个概念。对不起我对正则表达式不是很了解。

2 个答案:

答案 0 :(得分:1)

嗯......这就是你所要求的。但它非常丑陋且非常具体到您提供的示例。我怀疑它会对真实数据文件失败。

当面对这种解析工作时,解决问题的一种方法是通过一些初步清理来运行输入数据,在可能的情况下简化和合理化文本。例如,处理不同风格的整数列表很烦人,并使正则表达式更复杂。如果您可以删除不必要的逗号 - 整数之间并删除终端"或 - 和"正则表达式可以简单得多。完成这种清理后,有时您可以应用一个或多个正则表达式来提取所需的位。在某些情况下,无法满足主要正则表达式的异常值的数量可以使用特定查找或硬编码的特殊情况规则来处理。

import re

lines = [
    "if magic code is 543, and type is 5642, 912342, or 7425, type has to have a period. EX: 02-15-99",
    "If Magic Code is 722, 43, or 643256 and types is 43234, 5356, and 2112, type has to start with period.",
    "if magic code is 4542 it is not valid in type.",
    "if magic code is 532 and date is within 10 years from current data, and the type is 43, the type must begin with law number.",
]

mcs_rgx = re.compile(r'magic code is (\d+ (or|and) \d+|\d+(, \d+)*,? (or|and) \d+|\d+)', re.IGNORECASE)
types_rgx = re.compile(r'types? is (\d+ (or|and) \d+|\d+(, \d+)*,? (or|and) \d+|\d+)', re.IGNORECASE)
rest_rgx1 = re.compile(r'(type (has|must).+)')
rest_rgx2 = re.compile(r'.+\d(.+)')
nums_rgx = re.compile(r'\d+')

for line in lines:

    m = mcs_rgx.search(line)
    if m:
        mcs_text = m.group(1)
        mcs = map(int, nums_rgx.findall(mcs_text))
    else:
        mcs = []

    m = types_rgx.search(line)
    if m:
        types_text = m.group(1)
        types = map(int, nums_rgx.findall(types_text))
    else:
        types = []

    m = rest_rgx1.search(line)
    if m:
        rest = [m.group(1)]
    else:
        m = rest_rgx2.search(line)
        if m:
            rest = [m.group(1)]
        else:
            rest = ['']

    print mcs, types, rest

输出:

[543] [5642, 912342, 7425] ['type has to have a period. EX: 02-15-99']
[722, 43, 643256] [43234, 5356, 2112] ['type has to start with period.']
[4542] [] [' it is not valid in type.']
[532] [43] ['type must begin with law number.']

答案 1 :(得分:0)

这是一个带有单个正则表达式的解决方案,以及事后的一些清理。这适用于您的所有示例,但正如评论中所述,如果您的句子变化远大于此,您应该探索除正则表达式之外的其他选项。

import re

sentences = ["if magic code is 543, and type is 5642, 912342, or 7425, type has to have a period. EX: 02-15-99",
             "If Magic Code is 722, 43, or 643256 and types is 43234, 5356, and 2112, type has to start with period.",
             "if magic code is 4542 it is not valid in type.",
             "if magic code is 532 and date is within 10 years from current data, and the type is 43, the type must begin with law number."]

pat = '(?i)^if\smagic\scode\sis\s(\d+(?:,?\s(?:\d+|or))*)(?:.*types?\sis\s(\d+(?:,?\s(?:\d+|or|and))*,)(.*\.)|(.*\.))'

find_ints = lambda s: [int(d) for d in re.findall('\d+', s)]

matches = [[g for g in re.match(pat,s).groups() if g] for s in sentences]

results = [[find_ints(m) for m in match[:-1]]+[[match[-1].strip()]] for match in matches] 

如果你需要的东西打印得很好,就像你的例子一样:

for r in results:
    print(*r, sep='')