使用stanford-nlp分组一些文本

时间:2011-11-28 17:35:59

标签: stanford-nlp

我正在使用stanford核心NLP,我使用这一行来加载一些模块来处理我的文本:

props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");

我可以加载一个模块来分块文本吗?

或者是否有任何使用stanford核心来改变某些文本的替代方法的建议?

谢谢

4 个答案:

答案 0 :(得分:5)

我认为解析器输出可用于获取NP块。看一下提供示例输出的Stanford Parser website上的无上下文表示。

答案 1 :(得分:5)

要使用Stanford NLP分块,您可以使用以下软件包:

  • YamCha:基于SVM的NP-chunker,也可用于POS标记,NER等.C / C ++开源。赢得了CoNLL 2000共享任务。 (比最终用户的专用POS标签更不自动。)
  • Mark Greenwood的名词Phrase Chunker:Ramshaw和Marcus的Java重新实现(1995)。
  • fnTBL:在C ++中快速灵活地实现基于转换的学习。包括POS标记器,以及NP分块和一般分块模型。

来源: http://www-nlp.stanford.edu/links/statnlp.html#NPchunk

答案 2 :(得分:0)

您需要的是CoreNLP中选区解析的输出,该输出可为您提供块的信息,例如动词短语(VPs),名词短语(NPs)等。据我所知,CoreNLP中没有方法可以为您提供块列表。这意味着您必须解析选区解析的实际输出以提取块。

例如,这是CoreNLP选区分析器对一个例句的输出:

(ROOT (S ("" "") (NP (NNP Anarchism)) (VP (VBZ is) (NP (NP (DT a) (JJ political) (NN philosophy)) (SBAR (WHNP (WDT that)) (S (VP (VBZ advocates) (NP (NP (JJ self-governed) (NNS societies)) (VP (VBN based) (PP (IN on) (NP (JJ voluntary) (, ,) (JJ cooperative) (NNS institutions))))))))) (, ,) (S (VP (VBG rejecting) (NP (JJ unjust) (NN hierarchy))))) (. .)))

如您所见,字符串中包含NP和VP标签,现在您必须通过解析此字符串来提取块的实际文本。让我知道您是否可以找到一种方法来提供块列表?!

答案 3 :(得分:0)

扩展 Pedram 的答案,可以使用以下代码:

from nltk.parse.corenlp import CoreNLPParser
nlp = CoreNLPParser('http://localhost:9000')  # Assuming CoreNLP server is running locally at port 9000


def extract_phrase(trees, labels):
    phrases = []
    for tree in trees:
        for subtree in tree.subtrees():
            if subtree.label() in labels:
                t = subtree
                t = ' '.join(t.leaves())
                phrases.append(t)
    return phrases


def get_chunks(sentence):
    trees = next(nlp.raw_parse(sentence))
    nps = extract_phrase(trees, ['NP', 'CC'])
    vps = extract_phrase(trees, ['VP'])
    return trees, nps, vps


if __name__ == '__main__':
    dialog = [
        "Anarchism is a political philosophy that advocates self-governed societies based on voluntary cooperative institutions rejecting unjust hierarchy"
    ]
    for sentence in dialog:
        trees, nps, vps = get_chunks(sentence)
        print("\n\n")
        print("Sentence: ", sentence)
        print("Tree:\n", trees)
        print("Noun Phrases: ", nps)
        print("Verb Phrases: ", vps)

"""
Sentence:  Anarchism is a political philosophy that advocates self-governed societies based on voluntary cooperative institutions rejecting unjust hierarchy
Tree:
 (ROOT
  (S
    (NP (NN Anarchism))
    (VP
      (VBZ is)
      (NP
        (NP (DT a) (JJ political) (NN philosophy))
        (SBAR
          (WHNP (WDT that))
          (S
            (VP
              (VBZ advocates)
              (NP
                (ADJP (NN self) (HYPH -) (VBN governed))
                (NNS societies))
              (PP
                (VBN based)
                (PP
                  (IN on)
                  (NP
                    (NP
                      (JJ voluntary)
                      (JJ cooperative)
                      (NNS institutions))
                    (VP
                      (VBG rejecting)
                      (NP (JJ unjust) (NN hierarchy)))))))))))))
Noun Phrases:  ['Anarchism', 'a political philosophy that advocates self - governed societies based on voluntary cooperative institutions rejecting unjust hierarchy', 'a political philosophy', 'self - governed societies', 'voluntary cooperative institutions rejecting unjust hierarchy', 'voluntary cooperative institutions', 'unjust hierarchy']
Verb Phrases:  ['is a political philosophy that advocates self - governed societies based on voluntary cooperative institutions rejecting unjust hierarchy', 'advocates self - governed societies based on voluntary cooperative institutions rejecting unjust hierarchy', 'rejecting unjust hierarchy']

"""
相关问题