正则表达式从字符串中分割标点符号

时间:2018-01-29 21:56:44

标签: python regex string nlp

我想使用带有正则表达式(re.sub()re.findall())的空格来拆分python字符串中的标点符号。因此"I like dog, and I like cat."应该成为"I like dog , and I like cat . "

我想要替换一串标点符号(python string.punctuation = "!"#$%&'()*+,-./:;<=>?@[\]^_{|}~"),但我也有一个我不想替换的特定缩写列表(比如说list1 = ["e.g." , "Miss."]。)我不喜欢我想替换多个标点符号(任意两个标点符号,如...,")或任何撇号,如I'm, you're, he's, we're

所以说我有list1 = ["e.g." , "Miss."]string.punctuation = "!"#$%&'()*+,-./:;<=>?@[\]^_{|}~"。给定字符串"I'm a cat, you're a dog, e.g. a cat... really?, non-dog!!",它应该变为"I'm a cat , you're a dog , e.g. a cat ... really ?, non-dog !! "

除了我的特定缩写列表和多个标点符号和撇号之外,是否有可以从字符串中拆分标点符号的正则表达式?

2 个答案:

答案 0 :(得分:1)

一般算法是从开始到结束处理输入字符串,扫描下一个“单词”是否在异常列表中(如果是,跳过它)或者是标点字符(如果是,则添加空格)。

这导致以下功能:

Do While headingStart <> -1 And count <= 3
...[Statement]...
count = count + 1
Loop

在测试框架中运行时

def preprocess(string, punctuation, exceptions):
    result = ''
    i = 0
    while i < len(string):
        foundException = False
        if i == 0 or not(string[i-1].isalpha()):
            for e in exceptions:
                if string[i:].lower().startswith(e.lower()) and (i+len(e) == len(string) or not(string[i+len(e)].isalpha())):
                    result += string[i:i+len(e)]
                    i += len(e)
                    foundException = True
                    break
        if not(foundException):
            if string[i] in punctuation:
                result += ' '
                while i < len(string) and string[i] in punctuation:
                    result += string[i]
                    i += 1
                result += ' '
            else:
                result += string[i]
                i += 1

    return result.replace('  ', ' ')

你得到第一句的预期结果

examples = """
I like dog, and I like cat.
I'm a cat, you're a dog, e.g. a cat... really?, non-dog!!
"""

for line in examples.split('\n'):
    result = preprocess (line, "!\"#$%&'()*+,\\-./:;<=>?@[\]^_{|}~", ["I'm", "you're", "e.g.", "he's", "we're", "Miss."])
    print (result)

但第二句将I like dog , and I like cat . 分开:

non-dog

表明你的规范是不精确的(除非I'm a cat , you're a dog , e.g. a cat ... really ?, non - dog !! 在异常列表中;然后它的行为符合预期)。

答案 1 :(得分:0)

我会使用像data = "this is, the data." myre = re.compile(r"[\.\,\:\;\?\(\)]") matches = myre.findall(data) for (var i = 0; i < matches.length; i++) { data.replace(matches[i], " "+matches[i]) } 这样的正则表达式模式来查找字符串中所有标点符号的匹配列表。然后循环每个匹配,将其替换为自身,并附加一个空格。

示例:

static