从RSS源中获取图像

时间:2013-08-12 15:28:05

标签: rss

首先,这不是一个编程相关的问题,我真的很抱歉在这里发布它但我真的需要了解它。 我正在构建一个rss阅读器应用程序,我只是想知道哪些是有关特色图像到任何rss xml的信息。以下是我从CNN rss获取的xml文件的摘录,但其中有关于图像的信息。

<item><title>Ice melt speeding up, study finds</title><guid>http://edition.cnn.com/2012/11/29/world/europe/climate-ice-sheets/index.html</guid><link>http://edition.cnn.com/2012/11/29/world/europe/climate-ice-sheets/index.html?eref=edition</link><description>Two decades of satellite readings back up what dramatic pictures have suggested in recent years: The mile-thick ice sheets that cover Greenland and most of Antarctica are melting at a faster rate in a warming world.</description><pubDate>Thu, 27 Jun 2013 08:59:27 EDT</pubDate></item>
<item><title>Twins 'stolen' from hospital rescued</title><guid>http://edition.cnn.com/2013/08/10/world/asia/china-baby-trafficking-twin-girls/index.html</guid><link>http://edition.cnn.com/2013/08/10/world/asia/china-baby-trafficking-twin-girls/index.html?eref=edition</link><description>Police in China have rescued twin baby girls allegedly sold by a maternity doctor, bringing the number of infants recovered from the suspected trafficking ring to three, state media reported. </description><pubDate>Sun, 11 Aug 2013 19:31:43 EDT</pubDate></item>
<item><title>HK makes $5M ivory bust</title><guid>http://edition.cnn.com/2013/08/08/world/hong-kong-ivory-tusk-seizure-august/index.html</guid><link>http://edition.cnn.com/2013/08/08/world/hong-kong-ivory-tusk-seizure-august/index.html?eref=edition</link><description>In one of the biggest busts of its kind in Hong Kong, customs authorities this week seized more than 1,100 ivory tusks, 13 rhino horns and five leopard pelts. The haul, found in a container shipped from Nigeria, is valued at more than $5.3 million.</description><pubDate>Sun, 11 Aug 2013 19:31:58 EDT</pubDate></item>
<item><title>Human transmission of H7N9</title><guid>http://edition.cnn.com/2013/08/07/health/china-bird-flu-transmission/index.html</guid><link>http://edition.cnn.com/2013/08/07/health/china-bird-flu-transmission/index.html?eref=edition</link><description>Until this week, no cases of human-to-human transmission of the deadly bird flu virus that broke out in China this year had been reported.</description><pubDate>Wed, 07 Aug 2013 22:16:18 EDT</pubDate></item>
<item><title>Doctor accused of taking newborns</title><guid>http://edition.cnn.com/2013/08/07/world/asia/china-baby-trafficking-shaanxi/index.html</guid><link>http://edition.cnn.com/2013/08/07/world/asia/china-baby-trafficking-shaanxi/index.html?eref=edition</link><description>Chinese health authorities have promised an overhaul in hospitals across the country following the arrest of an obstetrician for allegedly selling newborns to human traffickers, state media reports.</description><pubDate>Wed, 07 Aug 2013 03:38:22 EDT</pubDate></item>
<item><title>Chinese tourists targeted in Paris</title><guid>http://edition.cnn.com/2013/08/07/travel/chinese-tourists-paris-pickpockets/index.html</guid><link>http://edition.cnn.com/2013/08/07/travel/chinese-tourists-paris-pickpockets/index.html?eref=edition</link><description>It's known as the City of Light, but it risks becoming known as the city of the light-fingered.</description><pubDate>Wed, 07 Aug 2013 22:16:33 EDT</pubDate></item>

我是否必须编写一个网络抓取工具,该网页抓取工具会跟随Feed链接并从目标网页中删除图片和文字? 我只需要知道专业的rss读者是如何工作的。

仅供参考,我已经搜索了很多关于此的信息,但未成功,所以这就是我问你们的原因。请帮忙。

1 个答案:

答案 0 :(得分:1)

由于有关图像的信息未存储在xml中,因此必须以某种方式对其进行爬网。

  

我是否必须编写一个遵循Feed链接的网络抓取工具   从目标网页中删除图片和文字?

是。对于您链接的cnn故事,标题图像始终位于div类“cnn_stryimg640captioned”内。

您必须单独处理视频和图片库(作为标题)。

  

我只需要知道专业的rss读者是如何工作的。

专业的rss读者有一些奇特的算法可以帮助他们确定哪些图像是文章的相关图像。他们并不总是正确,艰难。