根据预定义的字符类型拆分字符串

时间:2018-03-06 16:00:51

标签: python string parsing split

我有一个预定义的字符 - >类型字典。例如,' a' - 是一个小写字母,1是一个数字,')'是一个标点符号等。 使用以下脚本,我标记给定字符串中的所有字符:

labels=''
for ch in list(example):
    try:
        l = character_type_dict[ch]
        print(l)
        labels = labels+l
    except KeyError:
        labels = labels+'o'
        print('o')
labels

例如,给定"1,234.45kg (in metric system)"作为输入,代码生成dpdddpddwllwpllwllllllwllllllp作为输出。

现在,我想根据组拆分字符串。输出应该如下所示:

['1',',','234','.','45','kg',' ','(','in',' ','metric',' ','system',')']

也就是说,它应该根据字符类型边框进行拆分。 任何想法如何有效地完成这项工作?

5 个答案:

答案 0 :(得分:3)

(([A-Z]+[a-z|A-Z]* ){5}) 错误(在您的示例中为labels,但我认为它应该是'dpdddpddwllwpllwllllllwllllllp'

无论如何,你可以使用滥用'dpdddpddllwpllwllllllwllllllp'

itertools.groupby

答案 1 :(得分:1)

将此作为算法难题:

['1', ',', '234', '.', '45', 'kg', ' ', '(', 'in', ' ', 'metric', ' ', 'system', ')']

结果:

x(i,j)

答案 2 :(得分:1)

记住最后一个类的类:

import string
character_type = {c: "l" for c in string.ascii_letters}
character_type.update({c: "d" for c in string.digits})
character_type.update({c: "p" for c in string.punctuation})
character_type.update({c: "w" for c in string.whitespace})

example = "1,234.45kg (in metric system)"

x = []
prev = None
for ch in example:
    try:
        l = character_type[ch]
        if l == prev:
            x[-1].append(ch)
        else:
            x.append([ch])
    except KeyError:
        print(ch)
    else:
        prev = l
x = map(''.join, x)
print(list(x))
# ['1', ',', '234', '.', '45', 'kg', ' ', '(', 'in', ' ', 'metric', ' ', 'system', ')']

答案 3 :(得分:1)

另一种算法方法。而不是try: except:使用字典get(value, default_value)方法更好。

import string

character_type_dict = {}
for ch in string.ascii_lowercase:
    character_type_dict[ch] = 'l'
for ch in string.digits:
    character_type_dict[ch] = 'd'
for ch in string.punctuation:
    character_type_dict[ch] = 'p'
for ch in string.whitespace:
    character_type_dict[ch] = 'w'

example = "1,234.45kg (in metric system)"

split_list = []
split_start = 0
for i in range(len(example) - 1):
    if character_type_dict.get(example[i], 'o') != character_type_dict.get(example[i + 1], 'o'):
        split_list.append(example[split_start: i + 1])
        split_start = i + 1
split_list.append(example[split_start:])

print(split_list)

答案 4 :(得分:1)

您可以更简洁地计算标签(并且可能更快):

labels = ''.join(character_type_dict.get(ch, 'o') for ch in example)

或者,使用辅助函数:

character_type = lambda ch: character_type_dict.get(ch, 'o')
labels = ''.join(map(character_type, example))

但是你不需要标签来分割字符串;在itertools.groupby的帮助下,您可以直接拆分:

splits = list(''.join(g)
              for _, g in itertools.groupby(example, key=character_type)

一个可能更有趣的结果是类型和相关分组的元组向量:

 >>> list((''.join(g), code)
 ...      for code, g in itertools.groupby(example, key=character_type))
 [('1', 'd'), (',', 'p'), ('234', 'd'), ('.', 'p'), ('45', 'd'), ('kg', 'l'),
  (' ', 'w'), ('(', 'p'), ('in', 'l'), (' ', 'w'), ('metric', 'l'), (' ', 'w'),
  ('system', 'l'), (')', 'p')]

我按如下方式计算character_type_dict

character_type_dict = {}
for code, chars in (('w', string.whitespace),
                    ('l', string.ascii_letters),
                    ('d', string.digits),
                    ('p', string.punctuation)):
  for char in chars: character_type_dict[char] = code

但我也可以做到这一点(正如我后面想到的那样):

from collections import ChainMap
character_type_dict = dict(ChainMap(*({c:t for c in string.__getattribute__(n)}
                                    for t,n in (('w', 'whitespace')
                                               ,('d', 'digits')
                                               ,('l', 'ascii_letters')
                                               ,('p', 'punctuation')))))