从python字符串

时间:2017-06-02 05:38:01

标签: python string list

我正在尝试获取维基百科页面中的部分,子部分,子部分的层次结构。

我有一个这样的字符串:

mystr = 'a = b = = c = == d == == e == === f === === g === ==== h ==== === i === == j == == k == = l ='

在这种情况下,页面名称为“a”,结构如下

= b =
= c =
  == d ==
  == e ==
     === f ===
     === g ===
         ==== h ====
     === i ===
  == j ==
  == k ==
= l =

平等标志是部分或子部分的指标等。我需要获取一个包含所有关系层次结构的python列表,如下所示:

mylist = ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 
          'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']

到目前为止,我已经能够通过这样做找到部分,子部分等:

sections = re.findall(r' = (.*?)\ =', mystr)
subsections = re.findall(r' == (.*?)\ ==', mystr)
...

但我不知道如何从这里开始获得所需的mylist。

1 个答案:

答案 0 :(得分:0)

你可以这样做:
  - 第一个函数解析你的字符串,并产生令牌(级别,名称),如(0,' a'),(1,' b')
  - 第二个从那里构建树。

import re

def tokens(string):
    # The root name doesn't respect the '= name =' convention,
    # so we cut the string on the first " = " and yield the root name
    root_end = string.index(' = ') 
    root, rest = string[:root_end], string[root_end:]
    yield 0, root

    # We use a regex for the next tokens, who consist of the following groups:
    # - any number of "=" followed by 0 or more spaces,
    # - the name, not containing any =
    # - and again, the first group of "=..."

    tokens_re = re.compile(r'(=+ ?)([^=]+)\1')
    # findall will return a list:
    # [('= ', 'b '), ('= ', 'c '), ('== ', 'd '), ('== ', 'e '), ('=== ', 'f '), ...]
    for token in tokens_re.findall(rest):
        level = token[0].count('=')
        name = token[1].strip()
        yield level, name


def tree(token_list):    
    out = []
    # We keep track of the current position in the hierarchy:
    hierarchy = []
    for token in token_list:
        level, name = token
        # We cut the hierarchy below the level of our token
        hierarchy = hierarchy[:level]
        # and append the current one
        hierarchy.append(name)
        out.append('/'.join(hierarchy))
    return out


mystr = 'a = b = = c = == d == == e == === f === === g === ==== h ==== === i === == j == == k == = l ='
out = tree(tokens(mystr))
# Check that this is your expected output
assert out == ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 
          'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']

print(out)
# ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']