NLTK-用特定的单词替换大块

时间:2018-06-20 06:16:33

标签: python nltk text-chunking

我正在使用nltk进行NLP。我正在使用分块提取人员姓名。分块后,我想用特定的字符串“ Male”或“ Female”替换这些块。

我的代码是:

import nltk

with open('male_names.txt') as f1:
    male = [line.rstrip('\n') for line in f1]
with open('female_names.txt') as f2:
     female = [line.rstrip('\n') for line in f2]

with open("input.txt") as f:
    text = f.read()

words = nltk.word_tokenize(text)
tagged = nltk.pos_tag(words)
chunkregex = r"""Name: {<NNP>+}"""
chunkParser = nltk.RegexpParser(chunkregex)
chunked = chunkParser.parse(tagged)

for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Name'):
    chunk=[]
    for word, pos in subtree:
        chunk.append(word)
        temp = " ".join(chunk)
    **if temp in male:
        subtree = ('Male', pos)
    if temp in female:
        subtree = ('Female', pos)**
    print subtree

print chunked

我的输入数据是:

  

杰克·斯派洛(Jack Sparrow)船长到达牙买加的皇家港口,指挥一艘船。尽管营救了韦瑟比·斯旺(Wayby Swann)州长的女儿伊丽莎白·斯旺(Elizabeth Swann)溺死,但仍因海盗罪被判入狱。

当前输出为:

  

(S     (Name Captain/NNP Jack/NNP Sparrow/NNP)     到达/ VBZ     in / IN     (名称端口/ NNP皇家/ NNP)     in / IN     (名字牙买加/ NNP)     到/到     指挥官/ VB     / DT     船舶/ NN     ./。     尽管/ IN     营救/ VBG     (Name Elizabeth/NNP Swann/NNP)     ,/,     / DT     女儿/ NN     的/ IN     (Name Governor/NNP Weatherby/NNP Swann/NNP)     ,/,     来自/ IN     溺水/ VBG     ,/,     他/ PRP     是/ VBZ     入狱/ VBN     用于/ IN     盗版/ NN     ./.)

我想用“ Male”或“ Female”替换这些块,其输出应为:

  

(S     Male/NNP     到达/ VBZ     in / IN     (名称端口/ NNP皇家/ NNP)     in / IN     (名字牙买加/ NNP)     到/到     指挥官/ VB     / DT     船舶/ NN     ./。     尽管/ IN     营救/ VBG     Female/NNP     ,/,     / DT     女儿/ NN     的/ IN     Male/NNP     ,/,     来自/ IN     溺水/ VBG     ,/,     他/ PRP     是/ VBZ     入狱/ VBN     用于/ IN     盗版/ NN     ./.)

代码中的粗体部分没有执行应有的功能。 print subtree语句显示更改,但print chunked不变。

我在做什么错或者还有其他方法吗?
我是python和nltk的新手。任何帮助表示赞赏。

malefemale包含名称列表:

  

[“杰克·斯帕罗船长”,“州长韦瑟·斯旺”,“罗宾”]

     

[“伊丽莎白·斯旺”,“珍妮”]

1 个答案:

答案 0 :(得分:2)

我不知道我是否正确理解了您的问题。 NLTK子树只是普通的Python列表。因此,您也可以在此处执行常规的列表操作。尝试使用此代码段,而不要在代码中使用for循环部分。

for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Name'):
    full_name = []
    for word, pos in subtree:
        full_name.append(word)
        st = " ".join(full_name)  # iterate till the variable catches full name as tokenizer segments words.
        if st in male:
            subtree[:] = [("Male",pos)]  # replacing the subtree with our own value
        elif st in female:
            subtree[:] = [("Female",pos)]

输出:

> (S (Name male/NNP) arrives/VBZ in/IN (Name Port/NNP Royal/NNP) in/IN (Name Jamaica/NNP) to/TO commandeer/VB a/DT ship/NN ./. Despite/IN rescuing/VBG (Name female/NNP) ,/, the/DT daughter/NN of/IN (Name male/NNP) ,/, from/IN drowning/VBG ,/, he/PRP is/VBZ jailed/VB for/IN piracy/NN./.)