从句子中提取/解析代词-代词和动词-名词/代词组合

时间:2018-09-28 13:31:26

标签: python python-3.x nlp nltk spacy

问题:
我正在尝试从职位描述中提取一系列专有名词,如下所示。

text = "Civil, Mechanical, and Industrial Engineering majors are preferred."

我想从这段文字中提取以下内容:

Civil Engineering
Mechanical Engineering
Industrial Engineering

这是问题的一种情况,因此无法使用特定于应用程序的信息。例如,我无法列出专业名称,然后尝试检查这些专业名称的一部分是否与单词“ major”一起出现在句子中,因为其他句子也需要该名称。

尝试
1.我研究了 spacy dependency-parsing,但是在每种工程类型(土木,机械,工业)和“工程”一词之间都没有出现亲子关系。
< / p>

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Civil, Mechanical, and Industrial Engineering majors are preferred.")

print( "%-15s%-15s%-15s%-15s%-30s" % ( "TEXT","DEP","HEAD TEXT","HEAD POS","CHILDREN" ) )
for token in doc:
    if not token.text in ( ',','.' ):
        print( "%-15s%-15s%-15s%-15s%-30s" % 
          ( 
              token.text 
              ,token.dep_
              ,token.head.text
              ,token.head.pos_
              ,','.join( str(c) for c in token.children )
          ) )

...输出...

TEXT           DEP            HEAD TEXT      HEAD POS       CHILDREN                      
Civil          amod           majors         NOUN           ,,Mechanical                  
Mechanical     conj           Civil          ADJ            ,,and                         
and            cc             Mechanical     PROPN                                        
Industrial     compound       Engineering    PROPN                                        
Engineering    compound       majors         NOUN           Industrial                    
majors         nsubjpass      preferred      VERB           Civil,Engineering             
are            auxpass        preferred      VERB                                         
preferred      ROOT           preferred      VERB           majors,are,.                  
  1. 我也尝试过使用nltk pos标记,但是得到以下信息...

    导入nltk nltk.pos_tag(nltk.word_tokenize('最好是土木,机械和工业工程专业。'))

[('Civil', 'NNP'),
 (',', ','),
 ('Mechanical', 'NNP'),
 (',', ','),
 ('and', 'CC'),
 ('Industrial', 'NNP'),
 ('Engineering', 'NNP'),
 ('majors', 'NNS'),
 ('are', 'VBP'),
 ('preferred', 'VBN'),
 ('.', '.')]

工程学的类型和工程学一词都是NNP(专有名词),因此,我能想到的任何一种RegexpParser模式都行不通。

问题:
有人知道在Python 3中提取这些名词短语对的方法吗?

编辑:其他示例

以下示例与第一个示例类似,不同的是它们是动词-名词/动词-专有名词版本。

text="Experience with testing and automating API’s/GUI’s for desktop and native iOS/Android"

Extract:

testing API’s/GUI’s
automation API’s/GUI’s
text="Design, build, test, deploy and maintain effective test automation solutions"

Extract:

Design test automation solutions
build test automation solutions
test test automation solutions
deploy test automation solutions
maintain test automation solutions

1 个答案:

答案 0 :(得分:0)

在没有任何外部导入的情况下,并且假设列表始终以逗号分隔,并在最后一个之后加上可选的“和”,可以编写一些正则表达式并进行一些字符串操作以获得所需的输出:

import re

test_string = "Civil, Mechanical, and Industrial Engineering majors are preferred."
result = re.search(r"(([A-Z][a-z]+, )+(and)? [A-Z][a-z]+ ([A-Z][a-z]+))+", test_string)
group_type = result.group(4)
string_list = result.group(1).rstrip(group_type).strip()
items = [i.strip().strip('and ') + ' ' + group_type for i in string_list.split(',')]

print(items)  # ['Civil Engineering', 'Mechanical Engineering', 'Industrial Engineering']

同样,所有这些都是基于狭义的列表格式假设。如果存在更多可能性,则可能需要修改正则表达式模式。