Question

我想识别一组句子的主语和宾语。我的实际工作是从一组评审数据中确定因果关系。

我正在使用Spacy Package来分块和解析数据。但实际上没有实现我的目标。有没有办法这样做？

E.g：

 I thought it was the complete set

出：

subject  object
I        complete set

Answer 1

以最简单的方式。 token.dep_访问依赖项导入spacy：

import spacy
nlp = spacy.load('en')
parsed_text = nlp(u"I thought it was the complete set")

#get token dependencies
for text in parsed_text:
    #subject would be
    if text.dep_ == "nsubj":
        subject = text.orth_
    #iobj for indirect object
    if text.dep_ == "iobj":
        indirect_object = text.orth_
    #dobj for direct object
    if text.dep_ == "dobj":
        direct_object = text.orth_

print(subject)
print(direct_object)
print(indirect_object)

Answer 2

您可以使用名词块。

代码

doc = nlp("I thought it was the complete set")
for nc in doc.noun_chunks:
    print(nc.text)

结果：

I
it
the complete set

仅选择＆＃34;我＆＃34;而不是两个＆＃34;我＆＃34;和＆＃34;它＆＃34;，你可以先写一个测试来取得ROOT左边的nsubj。

Answer 3

Stanza 使用高度准确的神经网络组件构建而成，还可以使用您自己的带注释的数据进行高效的训练和评估。这些模块建立在 PyTorch 库之上。

Stanza 是一个 Python 自然语言分析包。它包含可在管道中使用的工具，将包含人类语言文本的字符串转换为句子和单词列表，生成这些单词的基本形式、词性和形态特征，以提供句法结构依赖解析, 并识别命名实体。

def find_Subject_Object(text):
    # import required packages
    import stanza
    nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma,depparse')
    doc = nlp(text)
    clausal_subject = []
    nominal_subject = []
    indirect_object = []
    Object          = []
    for sent in doc.sentences:
        for word in sent.words:
            if word.deprel  == "nsubj":
                nominal_subject.append({word.text:"nominal_subject nsubj"})
            elif word.deprel  == "csubj":
                clausal_subject.append({word.text:"clausal_subject csubj"})
            elif word.deprel  == "iobj":
                indirect_object.append({word.text:"indirect_object iobj"})
            elif word.deprel  == "obj":
                Object.append({word.text:"object obj"})
    return indirect_object, Object, clausal_subject,nominal_subject

text ="""John F. Kennedy International Airport is an international airport in Queens, New York, USA, and one of the primary airports serving New York City."""

find_Subject_Object(text)
# output #
([], [{'City': 'object obj'}], [], [{'John': 'nominal_subject nsubj'}, {'Airport': 'nominal_subject nsubj'}])

Stanza 包含一个到 CoreNLP Java 包的 Python 接口，并从那里继承了附加功能，例如选区解析、共指解析和语言模式匹配。

总而言之，Stanza 的特点：

本机 Python 实现需要最少的设置；
用于强大文本分析的完整神经网络管道，包括标记化、多词标记 (MWT) 扩展、词形还原、词性 (POS) 和形态特征标记、依存关系解析和命名实体识别；
支持 66 种（人类）语言的预训练神经模型；
一个稳定的、官方维护的 CoreNLP 的 Python 接口。 Stanza

python中的主题对象标识

3 个答案:

代码

结果：