从主页提取标题+链接

时间:2013-11-20 14:39:01

标签: python rss

我希望用python制作自己的RSS

是否可以从hdarea.org中提取标题和下载链接(“已上传”)

继承人code example

那是迄今为止的。

import urllib2
from BeautifulSoup import BeautifulSoup
import re

page = urllib2.urlopen("http://hd-area.org").read()
soup = BeautifulSoup(page)

for title in soup.findAll("div", {"class" : "title"}):
    print (title.getText())
for a in soup.findAll('a'):
  if 'Uploaded.net' in a:
    print a['href']

它已经提取了标题。

但我发现了应该提取链接的地方。

它提取但不正确...

任何建议如何确保脚本首先检查“div”和“link”是否在此div类中"<div class="topbox">"

修改

现在我已经完成了

这是最终代码

谢谢你们 - 让我朝着正确的方向前进

import urllib2
from BeautifulSoup import BeautifulSoup 
import datetime
import PyRSS2Gen

print "top_rls"
page = urllib2.urlopen("http://hd-area.org/index.php?s=Cinedubs").read()
soup = BeautifulSoup(page)
movieTit = []
movieLink = []
for title in soup.findAll("div", {"class" : "title"}):
    movieTit.append(title.getText())

for span in soup.findAll('span', attrs={"style":"display:inline;"},recursive=True):
    for a in span.findAll('a'):            
        if 'ploaded' in a.getText():
            movieLink.append(a['href'])
        elif 'cloudzer' in a.getText():
            movieLink.append(a['href'])

for i in range(len(movieTit)):
    print movieTit[i]
    print movieLink[i]

rss = PyRSS2Gen.RSS2(
    title = "HD-Area Cinedubs",
    link = "http://hd-area.org/index.php?s=Cinedubs",
    description = " "
                  " ",

    lastBuildDate = datetime.datetime.now(),
    items = [
       PyRSS2Gen.RSSItem(
         title = movieTit[0],
         link = movieLink[0]),
       PyRSS2Gen.RSSItem(
         title = movieTit[1],
         link = movieLink[1]),
       PyRSS2Gen.RSSItem(
         title = movieTit[2],
         link = movieLink[2]),
       PyRSS2Gen.RSSItem(
         title = movieTit[3],
         link = movieLink[3]),
       PyRSS2Gen.RSSItem(
         title = movieTit[4],
         link = movieLink[4]),
       PyRSS2Gen.RSSItem(
         title = movieTit[5],
         link = movieLink[5]),
       PyRSS2Gen.RSSItem(
         title = movieTit[6],
         link = movieLink[6]),
       PyRSS2Gen.RSSItem(
         title = movieTit[7],
         link = movieLink[7]),
       PyRSS2Gen.RSSItem(
         title = movieTit[8],
         link = movieLink[8]),
       PyRSS2Gen.RSSItem(
         title = movieTit[9],
         link = movieLink[9]),
    ])

rss.write_xml(open("cinedubs.xml", "w"))

2 个答案:

答案 0 :(得分:0)

就像这样:

movieTit = []
movieLink = []

for title in soup.findAll("div", {"class" : "title"}):
    movieTit.append(title.getText())
for a in soup.findAll('a'):
    if 'ploaded' in a.getText():
        movieLink.append(a['href'])

for i in range(0,len(movieTit)/2,2):
    print movieTit[i]
    print movieTit[i+1]
    print movieLink[i]
    print movieLink[i+1]

答案 1 :(得分:0)

首先找到所有

的一个建议
<div class="topbox">

如果页面中有多个此页面。你可以像这样使用find_all函数或find:

soup = BeautifulSoup(page)

# in case you want to find all of them
for item in soup.find_all('div', _class='topbox'):
    # in this line you have to check where is the title : <span>, <a> or other
    # check if the tag exist or not
    if item.span is not None: 
       title = item.span.text

    # the same for this
    if item.a is not None:
        link = item.a['href']

我在页面中找不到您想要的div。如果您需要,请告诉我您想要的确切。