从python Regexparser中提取文本

时间:2018-05-25 06:27:36

标签: python regex python-3.x nltk

我是 NLTK

的新手

这是我用过的代码,

text="The pizza was 66 and brilliant"
pattern = r"""
P: {<NN>+<VBD>+<CD>+}
"""
for sent in sent_tokenize(text):
  sentence = sent.split()
  PChunker = RegexpParser(pattern)
  output= PChunker.parse(pos_tag(sentence))
  print(output)

我收到了输出,

(S The/DT (P pizza/NN was/VBD 66/CD) and/CC brilliant/VB)

我需要输出,

pizza was 66

我怎么能得到这个?

1 个答案:

答案 0 :(得分:0)

RegexpParser.parse的输出是一棵树,您可以使用tree.subtrees进行循环。请尝试以下操作,立即过滤您感兴趣的非终端节点(在您的情况下为P):

from nltk import sent_tokenize
from nltk import RegexpParser
from nltk import pos_tag

text="The pizza was 66 and brilliant"
pattern = r"""
P: {<NN>+<VBD>+<CD>+}
"""
for sent in sent_tokenize(text):
  sentence = sent.split()
  PChunker = RegexpParser(pattern)
  output= PChunker.parse(pos_tag(sentence))
  print(output)
  for subtree in output.subtrees(filter=lambda t: t.label() == 'P'):
      print(subtree)
      print(' '.join([x[0] for x in subtree]))