我发现this link [和其他一些人]谈到了用于阅读html的BeautifulSoup。它主要完成我想要的操作,获取网页标题。
def get_title(url):
html = requests.get(url).text
if len(html) > 0:
contents = BeautifulSoup(html)
title = contents.title.string
return title
return None
我遇到的问题是,有时文章会以“-some_data”结尾时附带元数据的形式返回。一个不错的例子是this link到BBC体育文章中,标题标题为
杰克·查尔顿(Jack Charlton):1966年英格兰世界杯冠军去世,享年85岁-BBC Sport
我可以做一些简单的事情,例如切断最后一个'-'字符之后的所有内容
title = title.rsplit(', ', 1)[0]
但是,这假定在“-”值之后存在任何元数据。我不想假设永远不会有标题以“-part_of_title”结尾的文章
我找到了Newspaper3k library,但绝对超出了我的需求-我所需要的只是抓住一个标题,并确保它与用户发布的内容相同。向我指向Newspaper3k的朋友也提到,它可能有问题,而且有时无法正确找到标题,因此,如果可能,我倾向于使用其他内容。
我目前的想法是继续使用BeautifulSoup并仅添加fuzzywuzzy,这实际上也将有助于解决一些小的拼写错误或标点符号差异。但是,我当然希望从一个包括比较准确标题的地方开始。
答案 0 :(得分:1)
这是reddit处理获取标题数据的方式。
def extract_title(data):
"""Try to extract the page title from a string of HTML.
An og:title meta tag is preferred, but will fall back to using
the <title> tag instead if one is not found. If using <title>,
also attempts to trim off the site's name from the end.
"""
bs = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES)
if not bs or not bs.html.head:
return
head_soup = bs.html.head
title = None
# try to find an og:title meta tag to use
og_title = (head_soup.find("meta", attrs={"property": "og:title"}) or
head_soup.find("meta", attrs={"name": "og:title"}))
if og_title:
title = og_title.get("content")
# if that failed, look for a <title> tag to use instead
if not title and head_soup.title and head_soup.title.string:
title = head_soup.title.string
# remove end part that's likely to be the site's name
# looks for last delimiter char between spaces in strings
# delimiters: |, -, emdash, endash,
# left- and right-pointing double angle quotation marks
reverse_title = title[::-1]
to_trim = re.search(u'\s[\u00ab\u00bb\u2013\u2014|-]\s',
reverse_title,
flags=re.UNICODE)
# only trim if it won't take off over half the title
if to_trim and to_trim.end() < len(title) / 2:
title = title[:-(to_trim.end())]
if not title:
return
# get rid of extraneous whitespace in the title
title = re.sub(r'\s+', ' ', title, flags=re.UNICODE)
return title.encode('utf-8').strip()