背景信息:我正在抓取谷歌新闻的首页,找到所有常见的专有名词双字母(名称和组织)。我通过按照bigram和大写的频率进行过滤来过滤掉一些不相关的双字母。然而,nltk错误地将许多专有名词标记为形容词,我不确定为什么。任何帮助将不胜感激!
代码:
wikis = ["https://news.google.com/"]
for wiki in wikis:
website = requests.get(wiki)
soup = BeautifulSoup(website.content, "lxml")
text = ''.join([element.text for element in soup.body.find_all(lambda tag: tag != 'script', recursive=False)])
new = re.sub(r'[^a-zA-Z \n]','',text)
words = nltk.word_tokenize(new)
finder = nltk.bigrams(words)
dist = nltk.FreqDist(finder)
parse = [[" ".join(k),v] for k,v in dist.items()]
createcolumns = pd.DataFrame(parse, columns=['Bigrams','Frequency'])
webbigrams = createcolumns[createcolumns['Frequency'] >= 3]
x = webbigrams['Bigrams']
y = [s for s in x if all(x[0].isupper() for x in s.split())]
tags = nltk.pos_tag(y)
res = nltk.ne_chunk(tags)
print(res)
输出错误的示例:
Jennifer Westfeldt/JJ
New England/NN
US Open/JJ
Tom Brady/JJ
York Times/NNS
Stephen Colbert/JJ
Pittsburgh Steelers/NNS
Queen Elizabeth/JJ