Question

因此运行Stanford Parser，我的输出文件包含一个Penn Tree结构格式。每个文件都包含以下内容。

(ROOT
  (S
    (S
      (NP
        (NP (DT The) (JJS strongest) (NN rain))
        (VP
          (ADVP (RB ever))
          (VBN recorded)
          (PP (IN in)
            (NP (NNP India)))))
      (VP
        (VP (VBD shut)
          (PRT (RP down))
          (NP
            (NP (DT the) (JJ financial) (NN hub))
            (PP (IN of)
              (NP (NNP Mumbai)))))
        (, ,)
        (VP (VBD snapped)
          (NP (NN communication) (NNS lines)))
        (, ,)
        (VP (VBD closed)
          (NP (NNS airports)))
        (CC and)
        (VP (VBD forced)
          (NP
            (NP (NNS thousands))
            (PP (IN of)
              (NP (NNS people))))
          (S
            (VP (TO to)
              (VP
                (VP (VB sleep)
                  (PP (IN in)
                    (NP (PRP$ their) (NNS offices))))
                (CC or)
                (VP (VB walk)
                  (NP (NN home))
                  (PP (IN during)
                    (NP (DT the) (NN night))))))))))
    (, ,)
    (NP (NNS officials))
    (VP (VBD said)
      (NP-TMP (NN today)))
    (. .)))

现在我想在bash中使用脚本来使用bash提取所有名词短语。我知道有一种方法可以在java中完成它我不知道如何通过将文本文件读入树中来使其工作。

Answer 1

这是一个快速而又脏的Awk脚本，用于提取最外层的NP子树。如果你也想要内部，你需要一个正确的递归解决方案。

awk -v RS='[ \t\n]+' '
    !np && /^\(NP$/ { np=paren }
    /^\(/ { ++paren }
    /\)/ { b=$0; c=""; while (sub(/\)$/, "", b)) {paren--; c=c ")"
        if (np && paren == np) {
            d=b; gsub(/\)+$/, "", d); print a " " d c; np=0; a=c="" } } }
    np { a=a (a ? " " : "") $0 }'

Answer 2

Stanford提供了另一个名为tregex的工具，它在解析树上运行，并将根据类似于正则表达式的查询语言提取子树。

http://nlp.stanford.edu/software/tregex.shtml http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/tregex/TregexPattern.html

可以从命令行运行此工具。

使用bash从stanford解析器输出文本文件中提取所有名词短语

2 个答案: