使用Django获取外部URL的标题和描述

时间:2014-03-06 08:25:52

标签: python django

我想知道如何使用它的URL提取外部网站的TitleMetadescription。我找到了一些解决方案,但没有找到django / python。

目前我的代码添加了一个指向数据库的链接,我希望在添加该链接后转到该链接,然后使用相应的TitleMetadescription更新该条目。

能够检索ogmeta property="og:url标记也很不错。

谢谢。

4 个答案:

答案 0 :(得分:3)

要访问外部网站的标题或说明,您必须做两件事。

1)您需要获取html外部网站。 2)您需要解析html并获取title元素和元元素。

第一部分很简单:

import urllib2
opener = urllib2.build_opener()
external_sites_html = opener.open(external_sites_url).read()

第二部分更难,因为我们需要使用外部库来解析html,我喜欢一个名为BeautifulSoup的库,因为它有一个非常好的api。 (程序员很容易使用。)

from bs4 import BeautifulSoup
soup = BeautifulSoup(external_sites_html)
# Now we can get the tags of the external site from the soup variable.
title = soup.title.string

但是,重要的是要记住外部网站在我们获取它时只能缓慢响应,因此在数据库中创建外部网站记录,然后向用户返回回复可能是明智的。然后在其他一些过程中,您应该去获取URL并将额外信息添加到数据库中。如果在回复中返回额外信息很重要,那么您无法在后台执行此操作,并且必须让您的用户等待。

答案 1 :(得分:1)

我得到@ryan-pergent的答案并得到了改善,metadata.py

import re
import subprocess
from subprocess import TimeoutExpired
from bs4 import BeautifulSoup, Comment
from urllib.parse import urljoin

class Metadata:
    url = ""
    type = "" # https://ogp.me/#types
    title = ""
    description = ""
    image = ""

    def __str__(self):
        return "{url: " + self.url + ", type: " + self.type + ", title: " + self.title + ", description: " + self.description + ", image: " + self.image + "}"

class Metadatareader:

    @staticmethod
    def get_metadata_from_url_in_text(text):
        # look for the first url in the text
        # and extract the url metadata
        urls_in_text = Metadatareader.get_urls_from_text(text)
        if len(urls_in_text) > 0:
            return Metadatareader.get_url_metadata(urls_in_text[0])
        return Metadata()

    @staticmethod
    def get_urls_from_text(text):
        # look for all urls in text
        # and convert it to an array of urls
        regex = r"(?:(?:https?|ftp):\/\/|\b(?:[a-z\d]+\.))(?:(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))?\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))?\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’]))?"
        return re.findall(regex, text)

    @staticmethod
    def get_url_metadata(url):
        # get final url after all redirections
        # then get html of the final url
        # fill the meta data with the info available
        url = Metadatareader.get_final_url(url)
        url_content = Metadatareader.get_url_content(url)
        soup = BeautifulSoup(url_content, "html.parser")
        metadata = Metadata()

        metadata.url = url
        metadata.type = "website"

        for meta in soup.findAll("meta"):
            # priorize using Open Graph Protocol
            # https://ogp.me/
            metadata.type = Metadatareader.get_meta_property(meta, "og:type", metadata.type)
            metadata.title = Metadatareader.get_meta_property(meta, "og:title", metadata.title)
            metadata.description = Metadatareader.get_meta_property(meta, "og:description", metadata.description)
            metadata.image = Metadatareader.get_meta_property(meta, "og:image", metadata.image)
            if metadata.image:
                metadata.image = urljoin(url, metadata.image)

        if not metadata.title and soup.title:
            # use page title
            metadata.title = soup.title.text

        if not metadata.image:
            # use first img element
            images = soup.find_all('img')
            if len(images) > 0:
                metadata.image = urljoin(url, images[0].get('src'))

        if not metadata.description and soup.body:
            # use text from body
            for text in soup.body.find_all(string=True):
                if text.parent.name != 'script' and text.parent.name != 'style' and not isinstance(text, Comment):
                    metadata.description += text

        if metadata.description:
            # remove white spaces and break lines
            metadata.description = re.sub('\n|\r|\t', ' ', metadata.description)
            metadata.description = re.sub(' +', ' ', metadata.description)
            metadata.description = metadata.description.strip()

        return metadata

    @staticmethod
    def get_final_url(url, timeout=5):
        # get final url after all redirections
        # get http response header
        # look for the "Location: " header
        proc = subprocess.Popen([
                    "curl",
                    "-Ls",#follow redirect 301 and silently
                    "-I",#dont download html body
                    url
                ], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        try:
            out, err = proc.communicate(timeout=timeout)
        except TimeoutExpired:
            proc.kill()
            out, err = proc.communicate()
        header = str(out).split("\\r\\n")
        for line in header:
            if line.startswith("Location: "):
                return line.replace("Location: ", "")
        return url

    @staticmethod
    def get_url_content(url, timeout=5):
        # get url html
        proc = subprocess.Popen([
                    "curl",
                    "-i",
                    "-k",#ignore ssl certificate requisite
                    "-L",#follow redirect 301
                    url
                ], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        try:
            out, err = proc.communicate(timeout=timeout)
        except TimeoutExpired:
            proc.kill()
            out, err = proc.communicate()
        return out

    @staticmethod
    def get_meta_property(meta, property_name, default_value=""):
        if 'property' in meta.attrs and meta.attrs['property'] == property_name:
            return meta.attrs['content']
        return default_value

这是我的用法:

from metadatareader import Metadata, Metadatareader

content = "YOUR TEXT CONTAING URLS GOES HERE, LIKE google.com"
metadata = Metadatareader.get_metadata_from_url_in_text(content)
print(metadata)

答案 2 :(得分:0)

您是否在询问从外部网页中提取标题和元标记的问题?我是机械化和BeautifulSoup的粉丝。提取标题的一个例子如下。

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
def get_title(url):
    br = Browser()
    r = br.open(url)
    soup = BeautifulSoup(r)
    return soup.find("title").text

获取元标记,我会使用

的内容
for meta in soup.findAll("meta"):
    print (meta['name'], meta['content'])

当然你可能想要做一些其他事情而不是打印它们。

答案 3 :(得分:0)

我是这样做的:

matches1 = matches[matches.age < 21]
                          .groupby(['id'])['name'].agg({'result':', '.join, 'new_col': len})
print (matches1)
    new_col result
id                
1         1      a
2         2   c, d

print (matches.join(matches1, on='id'))
   id name  age  new_col result
0   1    a   19        1      a
1   1    b   25        1      a
2   2    c   19        2   c, d
3   2    d   18        2   c, d

如果没有带og:title,og:description或og:image:)的元数据,请随意处理您自己的默认值

有关BeautifulSoup的更多信息: https://www.crummy.com/software/BeautifulSoup/bs4/doc/