Question

app.run ()

输出：python 浮点四舍五入蟒蛇没有谷歌，colaboratory 蟒蛇烧瓶蟒蛇 beautifulsoup 蟒蛇 nonetype 蟒蛇 Ubuntu的等

从堆栈溢出站点提取数据时，我们在抓取问题的标签时遇到问题。我们能够刮掉标签，但它们没有被问及。问题是每个问题都是不同的例如：如果问题有python作为唯一的标签，那么该类是“标签t-python” 如果有更多的标签，那么它继续像“标签t-python t-python 3.x等” 取决于每个问题中的标签数量。你能告诉我们应该怎么做。谢谢。

Answer 1

您只需将搜索HTML类名称的方法更改为href链接即可。例如，抓住这个问题将产生：

from bs4 import BeautifulSoup as soup
import urllib
import re
question_html = str(urllib.urlopen('https://stackoverflow.com/questions/49332852/how-to-web-scrape-tags-for-stack-overflow-questions-using-beautifulsoup').read())
tags = {i.text for i in soup(question_html, 'lxml').find_all('a', href=True) if re.findall('questions/tagged/[\w\W]+$', i['href'])}

输出：

set([u'python', u'beautifulsoup'])

使用更多标签刮取问题将产生：

question_html = str(urllib.urlopen('https://stackoverflow.com/questions/49337964/returning-removed-elements-in-a-doubly-linked-list').read())
tags = {i.text for i in soup(question_html, 'lxml').find_all('a', href=True) if re.findall('questions/tagged/[\w\W]+$', i['href'])}

输出：

set([u'python', u'doubly-linked-list', u'return-value', u'python-3.x'])

如何使用beautifulSoup网页抓取Stack Overflow问题的标签？

1 个答案: