我正在雇用Stanford Parser来解析中文文本。我想从输入的中文文本中提取无上下文语法生成规则。
我将环境设置为Stanford Parser and NLTK。
我的代码如下:
from nltk.parse import stanford
parser = stanford.StanfordParser(path_to_jar='/home/stanford-parser-full-2013-11-12/stanford-parser.jar',
path_to_models_jar='/home/stanford-parser-full-2013-11-12/stanford-parser-3.3.0-models.jar',
model_path="/home/stanford-parser-full-2013-11-12/chinesePCFG.ser.gz",encoding='utf8')
text = '我 对 这个 游戏 有 一 点 上瘾。'
sentences = parser.raw_parse_sents(unicode(text, encoding='utf8'))
然而,当我尝试
时print sentences
我得到了
[Tree('ROOT', [Tree('IP', [Tree('NP', [Tree('PN', ['\u6211'])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VA', ['\u5bf9'])])])]), Tree('ROOT', [Tree('IP', [Tree('NP', [Tree('PN', ['\u8fd9'])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('QP', [Tree('CLP', [Tree('M', ['\u4e2a'])])])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VV', ['\u6e38'])])])]), Tree('ROOT', [Tree('FRAG', [Tree('NP', [Tree('NN', ['\u620f'])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VE', ['\u6709'])])])]), Tree('ROOT', [Tree('FRAG', [Tree('QP', [Tree('CD', ['\u4e00'])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VV', ['\u70b9'])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VV', ['\u4e0a'])])])]), Tree('ROOT', [Tree('FRAG', [Tree('NP', [Tree('NN', ['\u763e'])])])]), Tree('ROOT', [Tree('IP', [Tree('NP', [Tree('PU', ['\u3002'])])])])]
其中,中文单词彼此分开。应该有9个子树,但实际上返回了12个子树。有谁能告诉我这是什么问题?
继续,我尝试从中收集所有无上下文的语法制作规则。
for subtree in sentences:
for production in subtree.productions():
lst.append(production)
print lst
[ROOT -> IP, IP -> NP, NP -> PN, PN -> '\u6211', ROOT -> IP, IP -> VP, VP -> VA, VA -> '\u5bf9', ROOT -> IP, IP -> NP, NP -> PN, PN -> '\u8fd9', ROOT -> IP, IP -> VP, VP -> QP, QP -> CLP, CLP -> M, M -> '\u4e2a', ROOT -> IP, IP -> VP, VP -> VV, VV -> '\u6e38', ROOT -> FRAG, FRAG -> NP, NP -> NN, NN -> '\u620f', ROOT -> IP, IP -> VP, VP -> VE, VE -> '\u6709', ROOT -> FRAG, FRAG -> QP, QP -> CD, CD -> '\u4e00', ROOT -> IP, IP -> VP, VP -> VV, VV -> '\u70b9', ROOT -> IP, IP -> VP, VP -> VV, VV -> '\u4e0a', ROOT -> FRAG, FRAG -> NP, NP -> NN, NN -> '\u763e', ROOT -> IP, IP -> NP, NP -> PU, PU -> '\u3002']
但是中文单词仍然是分开的。
由于我对Java知之甚少,所以我必须使用Python接口来实现我的任务。我真的需要来自stackoverflow社区的帮助。任何人都可以帮我吗?
答案 0 :(得分:0)
我找到了解决方案:
使用parser.raw_parse
代替parser.raw_parse_sents
将解决问题。因为parser.raw_parse_sents
用于列表。