Question

我想使用带有正则表达式（re.sub()或re.findall()）的空格来拆分python字符串中的标点符号。因此"I like dog, and I like cat."应该成为"I like dog , and I like cat . "

我想要替换一串标点符号（python string.punctuation = "!"#$%&'()*+,-./:;<=>?@[\]^_{|}~"），但我也有一个我不想替换的特定缩写列表（比如说list1 = ["e.g." , "Miss."]。）我不喜欢我想替换多个标点符号（任意两个标点符号，如...或,"）或任何撇号，如I'm, you're, he's, we're。

所以说我有list1 = ["e.g." , "Miss."]和string.punctuation = "!"#$%&'()*+,-./:;<=>?@[\]^_{|}~"。给定字符串"I'm a cat, you're a dog, e.g. a cat... really?, non-dog!!"，它应该变为"I'm a cat , you're a dog , e.g. a cat ... really ?, non-dog !! "

除了我的特定缩写列表和多个标点符号和撇号之外，是否有可以从字符串中拆分标点符号的正则表达式？

Answer 1

一般算法是从开始到结束处理输入字符串，扫描下一个“单词”是否在异常列表中（如果是，跳过它）或者是标点字符（如果是，则添加空格）。

这导致以下功能：

Do While headingStart <> -1 And count <= 3
...[Statement]...
count = count + 1
Loop

在测试框架中运行时

def preprocess(string, punctuation, exceptions):
    result = ''
    i = 0
    while i < len(string):
        foundException = False
        if i == 0 or not(string[i-1].isalpha()):
            for e in exceptions:
                if string[i:].lower().startswith(e.lower()) and (i+len(e) == len(string) or not(string[i+len(e)].isalpha())):
                    result += string[i:i+len(e)]
                    i += len(e)
                    foundException = True
                    break
        if not(foundException):
            if string[i] in punctuation:
                result += ' '
                while i < len(string) and string[i] in punctuation:
                    result += string[i]
                    i += 1
                result += ' '
            else:
                result += string[i]
                i += 1

    return result.replace('  ', ' ')

你得到第一句的预期结果

examples = """
I like dog, and I like cat.
I'm a cat, you're a dog, e.g. a cat... really?, non-dog!!
"""

for line in examples.split('\n'):
    result = preprocess (line, "!\"#$%&'()*+,\\-./:;<=>?@[\]^_{|}~", ["I'm", "you're", "e.g.", "he's", "we're", "Miss."])
    print (result)

但第二句将I like dog , and I like cat .分开：

non-dog

表明你的规范是不精确的（除非I'm a cat , you're a dog , e.g. a cat ... really ?, non - dog !!在异常列表中;然后它的行为符合预期）。

Answer 2

我会使用像data = "this is, the data." myre = re.compile(r"[\.\,\:\;\?]") matches = myre.findall(data) for (var i = 0; i < matches.length; i++) { data.replace(matches[i], " "+matches[i]) }这样的正则表达式模式来查找字符串中所有标点符号的匹配列表。然后循环每个匹配，将其替换为自身，并附加一个空格。

示例：

static

正则表达式从字符串中分割标点符号

2 个答案: