网页刮痧后,NLTK将专有名词标记为形容词

时间:2015-09-08 17:01:26

标签: python-3.x nltk

背景信息:我正在抓取谷歌新闻的首页,找到所有常见的专有名词双字母(名称和组织)。我通过按照bigram和大写的频率进行过滤来过滤掉一些不相关的双字母。然而,nltk错误地将许多专有名词标记为形容词,我不确定为什么。任何帮助将不胜感激!

代码:

wikis = ["https://news.google.com/"]
for wiki in wikis:
    website = requests.get(wiki)
    soup = BeautifulSoup(website.content, "lxml")
    text = ''.join([element.text for element in soup.body.find_all(lambda     tag: tag != 'script', recursive=False)])
    new =  re.sub(r'[^a-zA-Z \n]','',text)

words = nltk.word_tokenize(new)
finder = nltk.bigrams(words)
dist = nltk.FreqDist(finder)
parse = [[" ".join(k),v] for k,v in dist.items()]
createcolumns = pd.DataFrame(parse, columns=['Bigrams','Frequency'])
webbigrams = createcolumns[createcolumns['Frequency'] >= 3]
x = webbigrams['Bigrams']
y = [s for s in x if all(x[0].isupper() for x in s.split())]
tags = nltk.pos_tag(y)
res = nltk.ne_chunk(tags)
print(res)

输出错误的示例:

Jennifer Westfeldt/JJ
New England/NN
US Open/JJ
Tom Brady/JJ
York Times/NNS
Stephen Colbert/JJ
Pittsburgh Steelers/NNS
Queen Elizabeth/JJ

1 个答案:

答案 0 :(得分:0)

看看spaCy。在This post中,他也比较了一些POS准确度。