如何在python中将nltk树(斯坦福)转换为newick格式?

时间:2016-09-25 20:08:37

标签: python tree nltk

我有这个斯坦福树,我想把它转换成新的格式。

    (ROOT
     (S
        (NP (DT A) (NN friend))
        (VP
         (VBZ comes)
         (NP
           (NP (JJ early))
           (, ,)
           (NP
             (NP (NNS others))
             (SBAR
                (WHADVP (WRB when))
                (S (NP (PRP they)) (VP (VBP have) (NP (NN time))))))))))

1 个答案:

答案 0 :(得分:1)

可能有一些方法只使用字符串处理来执行此操作,但我会解析它们并以递归方式以newick格式打印它们。一个有点最小的实现:

import re

class Tree(object):
    def __init__(self, label):
        self.label = label
        self.children = []

    @staticmethod
    def _tokenize(string):
        return list(reversed(re.findall(r'\(|\)|[^ \n\t()]+', string)))

    @classmethod
    def from_string(cls, string):
        tokens = cls._tokenize(string)
        return cls._tree(tokens)

    @classmethod
    def _tree(cls, tokens):
        t = tokens.pop()
        if t == '(':
            tree = cls(tokens.pop())
            for subtree in cls._trees(tokens):
                tree.children.append(subtree)
            return tree
        else:
            return cls(t)

    @classmethod
    def _trees(cls, tokens):
        while True:
            if not tokens:
                raise StopIteration
            if tokens[-1] == ')':
                tokens.pop()
                raise StopIteration
            yield cls._tree(tokens)

    def to_newick(self):
        if self.children and len(self.children) == 1:
            return ','.join(child.to_newick() for child in self.children)
        elif self.chilren:
            return '(' + ','.join(child.to_newick() for child in self.children) + ')'
        else:
            return self.label

注意,当然,在转换过程中信息会丢失,因为只保留终端节点。用法:

>>> s = """(ROOT (..."""
>>> Tree.from_string(s).to_newick()
...