NLTK RegexParser:分块连续重叠名词

时间:2017-12-01 18:39:35

标签: python regex parsing nlp nltk

我想使用RegexParser将所有连续重叠的名词从文本中分块,例如,我有以下标记文本:

[('APPLE', 'NN'), ('BANANA', 'NN'), ('GRAPE', 'NN'), ('PEAR', 'NN')]

我想提取:

['APPLE BANANA', 'BANANA GRAPE', 'GRAPE PEAR']

我尝试使用以下语法来避免使用匹配的连续名词,但它不起作用:

"CONSEC_NOUNS: {(?=(<NN>{2}))}"

有没有办法做到这一点?

编辑:代码

import nltk

extract = []
grammar = "CONSEC_NOUNS: {(?=(<NN>{2}))}"
cp = nltk.RegexpParser(grammar)
result = cp.parse([('APPLE', 'NN'), ('BANANA', 'NN'), ('GRAPE', 'NN'), ('PEAR', 'NN')])

for elem in result:
    if type(elem) == nltk.tree.Tree:
        extract.append(' '.join([pair[0] for pair in elem.leaves()]))

>>> print(extract) //[]

// but I want to get ['APPLE BANANA', 'BANANA GRAPE', 'GRAPE PEAR']

2 个答案:

答案 0 :(得分:0)

代码

See regex in use here

(?<=\()'([^']*)'(?=.*?\('([^']*)')

用法

See code in use here

import re

r = re.compile(r"(?<=\()'([^']*)'(?=.*?\('([^']*)')")
s = "[('APPLE', 'NN'), ('BANANA', 'NN'), ('GRAPE', 'NN'), ('PEAR', 'NN')]"

for m in re.finditer(r, s):
    print m.group(1) + ' ' + m.group(2)

说明

  • (?<=\()肯定的背后隐藏确保匹配(字面上的内容
  • '按字面意思匹配
  • ([^']*)''之外的任何字符捕获到捕获组1
  • (?=.*?\('([^']*)')按字面意思匹配
  • .*?确定后续匹配的正向前瞻
    • \('任意次数匹配任何字符,但尽可能少
    • ('按字面意思匹配([^']*)
    • ''start_ticks=pygame.time.get_ticks() while mainloop: seconds=(pygame.time.get_ticks()-start_ticks)/1000 if seconds>10: break 之外的任何字符捕获到捕获组2
    • import './App.css'; class App extends Component { render() { return ( <div className="App"> <header className="App-header"> <img src={logo} className="App-logo" alt="logo" /> <h1 className="App-title">Welcome to React</h1> </header> <p className="App-intro"> To get started, edit <code>src/App.js</code> and save to reload. </p> </div> ); } } export default App; 按字面意思匹配

答案 1 :(得分:0)

RegexParser仅产生不重叠的块。我使用NLTK的 bigrams 获得了以下解决方案。 首先,我修改了grammar以匹配任何2个或更多连续名词。然后我根据结果创建二元组。

代码:

import nltk

grammar = "CONSEC_NOUNS: {<NN>{2,}}" # match 2 or more nouns
cp = nltk.RegexpParser(grammar)
result = cp.parse([('APPLE', 'NN'), ('BANANA', 'NN'), ('GRAPE', 'NN'), ('PEAR', 'NN'), ('GO', 'VB'), 
                        ('ORANGE', 'NN'), ('STRAWBERRY', 'NN'), ('MELON', 'NN')])

leaves = [chunk.leaves() for chunk in result if ((type(chunk) == nltk.tree.Tree) and chunk.label()=='CONSEC_NOUNS')]
noun_bigram_groups = [list(nltk.bigrams([w for w, t in leaf])) for leaf in leaves]

extract = [' '.join(nouns) for group in noun_bigram_groups for nouns in group]

print(extract)

输出为:

  

['APPLE BANANA', 'BANANA GRAPE', 'GRAPE PEAR', 'ORANGE STRAWBERRY', 'STRAWBERRY MELON']