由多个分隔符拆分字符串?

时间:2012-05-03 05:58:23

标签: python

  

可能重复:
  Python: Split string with multiple delimiters

我可以在Python中做类似的事吗?

VB.net中的分割方法:

Dim line As String = "Tech ID: xxxxxxxxxx Name: DOE, JOHN Account #: xxxxxxxx"
Dim separators() As String = {"Tech ID:", "Name:", "Account #:"}
Dim result() As String
result = line.Split(separators, StringSplitOptions.RemoveEmptyEntries)

3 个答案:

答案 0 :(得分:2)

鉴于此类数据格式不正确,您可以尝试re.split()

>>> import re
>>> mystring = "Field 1: Data 1 Field 2: Data 2 Field 3: Data 3"
>>> a = re.split(r"(Field 1:|Field 2:|Field 3:)",mystring)
['', 'Field 1:', ' Data 1 ', 'Field 2:', ' Data 2 ', 'Field 3:', ' Data 3']

如果数据格式正确,使用带引号的字符串和以逗号分隔的记录,您的工作会更容易。这将允许使用csv模块来解析逗号分隔的值文件。

编辑:

您可以使用列表推导过滤掉空白条目。

>>> a_non_empty = [s for s in a if s]
>>> a_non_empty
['Field 1:', ' Data 1 ', 'Field 2:', ' Data 2 ', 'Field 3:', ' Data 3']

答案 1 :(得分:1)

>>> import re
>>> str = "Tech ID: xxxxxxxxxx Name: DOE, JOHN Account #: xxxxxxxx"
>>> re.split("Tech ID:|Name:|Account #:",str)
['', ' xxxxxxxxxx ', ' DOE, JOHN ', ' xxxxxxxx']

答案 2 :(得分:0)

我建议采用不同的方法:

>>> import re
>>> subject = "Tech ID: xxxxxxxxxx Name: DOE, JOHN Account #: xxxxxxxx"
>>> regex = re.compile(r"(Tech ID|Name|Account #):\s*(.*?)\s*(?=Tech ID:|Name:|Account #:|$)")
>>> dict(regex.findall(subject))
{'Tech ID': 'xxxxxxxxxx', 'Name': 'DOE, JOHN', 'Account #': 'xxxxxxxx'}

通过这种方式,您可以获得这种数据的有用数据结构:字典。

作为评论的正则表达式:

regex = re.compile(
    r"""(?x)                         # Verbose regex:
    (Tech\ ID|Name|Account\ \#)      # Match identifier
    :                                # Match a colon
    \s*                              # Match optional whitespace
    (.*?)                            # Match any number of characters, as few as possible
    \s*                              # Match optional whitespace
    (?=                              # Assert that the following can be matched:
     Tech\ ID:|Name:|Account\ \#:    # The next identifier
     |$                              # or the end of the string
    )                                # End of lookahead assertion""")