有没有Facebook的Linter的开源版本?

时间:2012-11-24 03:23:02

标签: screen-scraping web-scraping

当您发布Facebook链接时,它会抓取文章标题,说明和相关图片。大多数主要网站都有所需的OG标签,因此很容易获取此信息,但FB也能够处理没有它们的网站(您可以尝试here)。

显然,他们已经建立了一个系统,可以在没有OG标签的情况下抓取这些信息。有谁知道是否有开源版本?

我认为它需要(按照每个部分的优先顺序):

名称:

  1. 检查og:title tag。
  2. 检查常规元“标题”标签。
  3. 检查h1标签。
  4. 说明

    1. 检查og:description tag。
    2. 检查常规元“描述标签”
    3. 检查div或p标签是否有足够的内容来指示正文段落
    4. 图片:

      1. 检查og:image tags
      2. 检查超过一定大小的图像(比如100x100)并优先考虑那些最先出现的图像。
      3. 非常感谢!

1 个答案:

答案 0 :(得分:0)

https://github.com/Anonyfox/node-htmlcarve

Node.js的htmlcarve模块完成了你所追求的大部分内容,这里是从this page生成的输出:

htmlcarve = require('htmlcarve');

htmlcarve.fromUrl('https://scotch.io/tutorials/using-mongoosejs-in-node-js-and-mongodb-applications', function(error, data) {
  console.log(JSON.stringify(data, null, 2));
});

这会产生:

{
  "source": {
    "html_meta": {
      "title": "Easily Develop Node.js and MongoDB Apps with Mongoose ⥠Scotch",
      "summary": "",
      "image": "/wp-content/themes/thirty/img/scotch-logo.png",
      "language": "en-US",
      "feed": "https://scotch.io/feed",
      "favicon": "https://scotch.io/wp-content/themes/thirty/img/icons/favicon-57.png",
      "author": "Chris Sevilleja"
    },
    "open_graph": {
      "title": "Easily Develop Node.js and MongoDB Apps with Mongoose",
      "summary": "",
      "image": "https://scotch.io/wp-content/uploads/2014/11/mongoosejs-node-mongodb-applications.png"
    },
    "twitter_card": {
      "title": "Easily Develop Node.js and MongoDB Apps with Mongoose",
      "summary": "",
      "author": "sevilayha"
    }
  },
  "result": {
    "title": "Easily Develop Node.js and MongoDB Apps with Mongoose",
    "summary": "",
    "image": "https://scotch.io/wp-content/uploads/2014/11/mongoosejs-node-mongodb-applications.png",
    "author": "sevilayha",
    "language": "en-US",
    "feed": "https://scotch.io/feed",
    "favicon": "https://scotch.io/wp-content/themes/thirty/img/icons/favicon-57.png"
  },
  "links": {
    "deep": "https://scotch.io/tutorials/using-mongoosejs-in-node-js-and-mongodb-applications",
    "shallow": "https://scotch.io/tutorials/using-mongoosejs-in-node-js-and-mongodb-applications",
    "base": "https://scotch.io"
  }
}

如果您安装了Node.js,请使用

进行安装
npm i -g htmlcarve

您可以直接从命令行运行它。