将括号中的“嵌套”字符串拆分为嵌套列表

时间:2014-12-22 23:55:32

标签: python list recursion

我有一个树的字符串表示。我想将其转换为嵌套列表。有没有办法以递归方式执行此操作,以便最终得到嵌套列表?

示例字符串如下所示:

(TOP (S (NP (PRP I)) (VP (VBP need) (NP (NP (DT a) (NN flight)) (PP
(IN from) (NP (NNP Atlanta))) (PP (TO to) (NP (NP (NNP Charlotte)) (NP
(NNP North) (NNP Carolina)))) (NP (JJ next) (NNP Monday))))))

到目前为止,我已经在下面了,但它并没有给我提供我想要的东西。

import sys
import re

for tree_str in sys.stdin:
    print [", ".join(x.split()) for x in re.split(r'[()]',tree_str) if x.strip()] 

4 个答案:

答案 0 :(得分:2)

我的方法是这样的:

import re


def make_tree(data):
    items = re.findall(r"\(|\)|\w+", data)

    def req(index):
        result = []
        item = items[index]
        while item != ")":
            if item == "(":
                subtree, index = req(index + 1)
                result.append(subtree)
            else:
                result.append(item)
            index += 1
            item = items[index]
        return result, index

    return req(1)[0]


string = "(TOP (S (NP (PRP I))..." # omitted for readability
tree = make_tree(string)

print(tree)
# Output: ['TOP', ['S', ['NP', ['PRP', 'I']]...

答案 1 :(得分:1)

有点hacky但无论如何都有点诀窍:)你肯定有你的嵌套列表。

import re
import ast

input = "(TOP (S (NP (PRP I)) (VP (VBP need) (NP (NP (DT a) (NN flight)) (PP (IN from) (NP (NNP Atlanta))) (PP (TO to) (NP (NP (NNP Charlotte)) (NP (NNP North) (NNP Carolina)))) (NP (JJ next) (NNP Monday))))))"

# replaces all brackets by square brackets
# and adds commas when needed
input = input.replace("(", "[")\
             .replace(")", "]")\
             .replace("] [", "], [")

# places all the words between double quotes
# and appends a comma after each
input = re.sub(r'(\w+)', r'"\1",', input)

# safely evaluates the resulting string
output = ast.literal_eval(input)

print(output)
print(type(output))

# ['TOP', ['S', ['NP', ['PRP', 'I']], ['VP', ['VBP', 'need'], ['NP', ['NP', ['DT', 'a'], ['NN', 'flight']], ['PP', ['IN', 'from'], ['NP', ['NNP', 'Atlanta']]], ['PP', ['TO', 'to'], ['NP', ['NP', ['NNP', 'Charlotte']], ['NP', ['NNP', 'North'], ['NNP', 'Carolina']]]], ['NP', ['JJ', 'next'], ['NNP', 'Monday']]]]]]
# <class 'list'>

注意:出于安全原因,如果表达式包含运算符或某种逻辑,ast.literal_eval()会引发错误,这样您就可以使用它而无需先检查恶意代码。

答案 2 :(得分:0)

S-Expressions编写一个简单的解析器并不难:

import pprint
import re

pattern = r'''
    (?P<open_paren> \( ) |
    (?P<close_paren> \) ) |
    (?P<word> \w+) |
    (?P<whitespace> \s+) |
    (?P<eof> $) |
    (?P<error> \S)
'''

scan = re.compile(pattern=pattern, flags=re.VERBOSE).finditer

text = '''
(TOP (S (NP (PRP I)) (VP (VBP need) (NP (NP (DT a) (NN flight))
 (PP (IN from) (NP (NNP Atlanta))) (PP (TO to) (NP (NP (NNP Charlotte))
 (NP (NNP North) (NNP Carolina)))) (NP (JJ next) (NNP Monday))))))
'''

ERR_MSG = 'input string kaputt!!'

stack = [[]]

for match in scan(text):
    token_type = match.lastgroup
    token = match.group(0)
    if token_type == 'open_paren':
        stack.append([])
    elif token_type == 'close_paren':
        top = stack.pop()
        stack[-1].append(top)
    elif token_type == 'word':
        stack[-1].append(token)
    elif token_type == 'whitespace':
        pass
    elif token_type == 'eof':
        break
    else:
        raise Exception(ERR_MSG)

if 1 == len(stack) == len(stack[0]):
    pprint.pprint(stack[0][0])
else:
    raise Exception(ERR_MSG)

结果:

['TOP',
 ['S',
  ['NP', ['PRP', 'I']],
  ['VP',
   ['VBP', 'need'],
   ['NP',
    ['NP', ['DT', 'a'], ['NN', 'flight']],
    ['PP', ['IN', 'from'], ['NP', ['NNP', 'Atlanta']]],
    ['PP',
     ['TO', 'to'],
     ['NP',
      ['NP', ['NNP', 'Charlotte']],
      ['NP', ['NNP', 'North'], ['NNP', 'Carolina']]]],
    ['NP', ['JJ', 'next'], ['NNP', 'Monday']]]]]]

答案 3 :(得分:-1)

这称为“解析”。 Python的一个解析器生成器似乎是Yapps。 Yapps的文档甚至shows如何编写一个Lisp解析器,你的应用程序似乎只是一个子集。

您需要的子集似乎是:

parser Sublisp:
    ignore:      '\\s+'
    token ID:    '[-+*/!@%^&=.a-zA-Z0-9_]+' 

    rule expr:   ID     {{ return ('id', ID) }}
               | list   {{ return list }}
    rule list: "\\("    {{ result = [] }} 
               ( expr   {{ result.append(expr) }}
               )*  
               "\\)"    {{ return result }}

编译之后,这会将你的字符串解析为元组树('id','FOO')。要以您想要的形式获取树,您可以修改生成的python代码(它非常易读)或者之后转换树。