Question

我发现this link [和其他一些人]谈到了用于阅读html的BeautifulSoup。它主要完成我想要的操作，获取网页标题。

def get_title(url):
    html = requests.get(url).text
    if len(html) > 0:
        contents = BeautifulSoup(html)
        title = contents.title.string
        return title
    return None

我遇到的问题是，有时文章会以“-some_data”结尾时附带元数据的形式返回。一个不错的例子是this link到BBC体育文章中，标题标题为

杰克·查尔顿（Jack Charlton）：1966年英格兰世界杯冠军去世，享年85岁-BBC Sport

我可以做一些简单的事情，例如切断最后一个'-'字符之后的所有内容

title = title.rsplit(', ', 1)[0]

但是，这假定在“-”值之后存在任何元数据。我不想假设永远不会有标题以“-part_of_title”结尾的文章

我找到了Newspaper3k library，但绝对超出了我的需求-我所需要的只是抓住一个标题，并确保它与用户发布的内容相同。向我指向Newspaper3k的朋友也提到，它可能有问题，而且有时无法正确找到标题，因此，如果可能，我倾向于使用其他内容。

我目前的想法是继续使用BeautifulSoup并仅添加fuzzywuzzy，这实际上也将有助于解决一些小的拼写错误或标点符号差异。但是，我当然希望从一个包括比较准确标题的地方开始。

Answer 1

这是reddit处理获取标题数据的方式。

https://github.com/reddit-archive/reddit/blob/40625dcc070155588d33754ef5b15712c254864b/r2/r2/lib/utils/utils.py#L255

def extract_title(data):
    """Try to extract the page title from a string of HTML.
    An og:title meta tag is preferred, but will fall back to using
    the <title> tag instead if one is not found. If using <title>,
    also attempts to trim off the site's name from the end.
    """
    bs = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES)
    if not bs or not bs.html.head:
        return
    head_soup = bs.html.head

    title = None

    # try to find an og:title meta tag to use
    og_title = (head_soup.find("meta", attrs={"property": "og:title"}) or
                head_soup.find("meta", attrs={"name": "og:title"}))
    if og_title:
        title = og_title.get("content")

    # if that failed, look for a <title> tag to use instead
    if not title and head_soup.title and head_soup.title.string:
        title = head_soup.title.string

        # remove end part that's likely to be the site's name
        # looks for last delimiter char between spaces in strings
        # delimiters: |, -, emdash, endash,
        #             left- and right-pointing double angle quotation marks
        reverse_title = title[::-1]
        to_trim = re.search(u'\s[\u00ab\u00bb\u2013\u2014|-]\s',
                            reverse_title,
                            flags=re.UNICODE)

        # only trim if it won't take off over half the title
        if to_trim and to_trim.end() < len(title) / 2:
            title = title[:-(to_trim.end())]

    if not title:
        return

    # get rid of extraneous whitespace in the title
    title = re.sub(r'\s+', ' ', title, flags=re.UNICODE)

    return title.encode('utf-8').strip()

如何在不包含站点数据的情况下从网页中获取准确的标题

1 个答案: