Question

之前已经多次询问和回答过这个问题。一些示例：[1]，[2]。但似乎没有更普遍的东西。我正在寻找的是一种用逗号分隔字符串的方法，这些逗号不在引号或分隔符对中。例如：

s1 = 'obj<1, 2, 3>, x(4, 5), "msg, with comma"'

应该拆分成三个元素的列表

['obj<1, 2, 3>', 'x(4, 5)', '"msg, with comma"']

现在的问题是，由于我们可以查看成对的<>和()，因此会变得更加复杂。

s2 = 'obj<1, sub<6, 7>, 3>, x(4, y(8, 9), 5), "msg, with comma"'

应分为：

['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with comma"']

不使用正则表达式的天真解决方案是通过查找字符,<(来解析字符串。如果找到<或(，我们就开始计算奇偶校验。如果奇偶校验为零，我们只能以逗号分割。例如，我们想要分割s2，我们可以从parity = 0开始，当我们到达s2[3]时，我们会遇到<，这会使奇偶校验增加1.奇偶校验只会减少当遇到>或)时会遇到<或(。虽然奇偶校验不是0，但我们可以简单地忽略逗号而不进行任何拆分。

这里的问题是，有没有办法快速使用正则表达式？我真的在研究这个solution，但这似乎并没有涵盖我给出的例子。

更通用的功能是这样的：

def split_at(text, delimiter, exceptions):
    """Split text at the specified delimiter if the delimiter is not
    within the exceptions"""

有些用途是这样的：

split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',', [('<', '>'), ('(', ')'), ('"', '"')]

regex是否能够处理这个问题，还是有必要创建一个专门的解析器？

Answer 1

虽然无法使用正则表达式，但以下简单代码将实现所需的结果：

def split_at(text, delimiter, opens='<([', closes='>)]', quotes='"\''):
    result = []
    buff = ""
    level = 0
    is_quoted = False

    for char in text:
        if char in delimiter and level == 0 and not is_quoted:
            result.append(buff)
            buff = ""
        else:
            buff += char

            if char in opens:
                level += 1
            if char in closes:
                level -= 1
            if char in quotes:
                is_quoted = not is_quoted

    if not buff == "":
        result.append(buff)

    return result

在解释器中运行它：

>>> split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',')                                                                                                                                 
#=>['obj<1, 2, 3>', ' x(4, 5)', ' "msg with comma"']

Answer 2

使用迭代器和生成器：

def tokenize(txt, delim=',', pairs={'"':'"', '<':'>', '(':')'}):
    fst, snd = set(pairs.keys()), set(pairs.values())
    it = txt.__iter__()

    def loop():
        from collections import defaultdict
        cnt = defaultdict(int)

        while True:
            ch = it.__next__()
            if ch == delim and not any (cnt[x] for x in snd):
                return
            elif ch in fst:
                cnt[pairs[ch]] += 1
            elif ch in snd:
                cnt[ch] -= 1
            yield ch

    while it.__length_hint__():
        yield ''.join(loop())

和

>>> txt = 'obj<1, sub<6, 7>, 3>,x(4, y(8, 9), 5),"msg, with comma"'
>>> [x for x in tokenize(txt)]
['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with comma"']

Answer 3

如果您有递归嵌套表达式，则可以在逗号上拆分并验证它们是否与pyparsing匹配：

import pyparsing as pp

def CommaSplit(txt):
    ''' Replicate the function of str.split(',') but do not split on nested expressions or in quoted strings'''
    com_lok=[]
    comma = pp.Suppress(',')
    # note the location of each comma outside an ignored expression:
    comma.setParseAction(lambda s, lok, toks: com_lok.append(lok))
    ident = pp.Word(pp.alphas+"_", pp.alphanums+"_")  # python identifier
    ex1=(ident+pp.nestedExpr(opener='<', closer='>'))   # Ignore everthing inside nested '< >'
    ex2=(ident+pp.nestedExpr())                       # Ignore everthing inside nested '( )'
    ex3=pp.Regex(r'("|\').*?\1')                      # Ignore everything inside "'" or '"'
    atom = ex1 | ex2 | ex3 | comma
    expr = pp.OneOrMore(atom) + pp.ZeroOrMore(comma  + atom )
    try:
        result=expr.parseString(txt)
    except pp.ParseException:
        return [txt]
    else:    
        return [txt[st:end] for st,end in zip([0]+[e+1 for e in com_lok],com_lok+[len(txt)])]             


tests='''\
obj<1, 2, 3>, x(4, 5), "msg, with comma"
nesteobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), "msg, with comma"
nestedobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), 'msg, with comma', additional<1, sub<6, 7>, 3>
bare_comma<1, sub(6, 7), 3>, x(4, y(8, 9), 5),  , 'msg, with comma', obj<1, sub<6, 7>, 3>
bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3)
'''

for te in tests.splitlines():
    result=CommaSplit(te)
    print(te,'==>\n\t',result)

打印：

obj<1, 2, 3>, x(4, 5), "msg, with comma" ==>
     ['obj<1, 2, 3>', ' x(4, 5)', ' "msg, with comma"']
nesteobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), "msg, with comma" ==>
     ['nesteobj<1, sub<6, 7>, 3>', ' nestedx(4, y(8, 9), 5)', ' "msg, with comma"']
nestedobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), 'msg, with comma', additional<1, sub<6, 7>, 3> ==>
     ['nestedobj<1, sub<6, 7>, 3>', ' nestedx(4, y(8, 9), 5)', " 'msg, with comma'", ' additional<1, sub<6, 7>, 3>']
bare_comma<1, sub(6, 7), 3>, x(4, y(8, 9), 5),  , 'msg, with comma', obj<1, sub<6, 7>, 3> ==>
     ['bare_comma<1, sub(6, 7), 3>', ' x(4, y(8, 9), 5)', '  ', " 'msg, with comma'", ' obj<1, sub<6, 7>, 3>']
bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3) ==>
     ["bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3)"]

当前行为与'(something does not split), b, "in quotes", c'.split',')类似，包括保留前导空格和引号。从字段中删除引号和前导空格是微不足道的。

将else下的try更改为：

else:
    rtr = [txt[st:end] for st,end in zip([0]+[e+1 for e in com_lok],com_lok+[len(txt)])]
    if strip_fields:
        rtr=[e.strip().strip('\'"') for e in rtr]
    return rtr

在python中拆分逗号分隔的字符串

3 个答案: