Question

背景信息：我正在抓取谷歌新闻的首页，找到所有常见的专有名词双字母（名称和组织）。我通过按照bigram和大写的频率进行过滤来过滤掉一些不相关的双字母。然而，nltk错误地将许多专有名词标记为形容词，我不确定为什么。任何帮助将不胜感激！

代码：

wikis = ["https://news.google.com/"]
for wiki in wikis:
    website = requests.get(wiki)
    soup = BeautifulSoup(website.content, "lxml")
    text = ''.join([element.text for element in soup.body.find_all(lambda     tag: tag != 'script', recursive=False)])
    new =  re.sub(r'[^a-zA-Z \n]','',text)

words = nltk.word_tokenize(new)
finder = nltk.bigrams(words)
dist = nltk.FreqDist(finder)
parse = [[" ".join(k),v] for k,v in dist.items()]
createcolumns = pd.DataFrame(parse, columns=['Bigrams','Frequency'])
webbigrams = createcolumns[createcolumns['Frequency'] >= 3]
x = webbigrams['Bigrams']
y = [s for s in x if all(x[0].isupper() for x in s.split())]
tags = nltk.pos_tag(y)
res = nltk.ne_chunk(tags)
print(res)

输出错误的示例：

Jennifer Westfeldt/JJ
New England/NN
US Open/JJ
Tom Brady/JJ
York Times/NNS
Stephen Colbert/JJ
Pittsburgh Steelers/NNS
Queen Elizabeth/JJ

Answer 1

看看spaCy。在This post中，他也比较了一些POS准确度。

网页刮痧后，NLTK将专有名词标记为形容词

1 个答案: