美丽的汤和声明

时间:2011-04-29 05:37:03

标签: python beautifulsoup

我正在尝试使用以下BeautifulSoup脚本找到前30个TED视频(视频名称和URL):

import urllib2
from BeautifulSoup import BeautifulSoup

total_pages = 3
page_count = 1
count = 1

url = 'http://www.ted.com/talks?page='

while page_count < total_pages:

    page = urllib2.urlopen("%s%d") %(url, page_count)

    soup = BeautifulSoup(page)

    link = soup.findAll(lambda tag: tag.name == 'a' and tag.findParent('dt', 'thumbnail'))

    outfile = open("test.html", "w")

    print >> outfile, """<head>
            <head>
                    <title>TED Talks Index</title>
            </head>

            <body>

            <br><br><center>

            <table cellpadding=15 cellspacing=0 style='border:1px solid #000;'>"""

    print >> outfile, "<tr><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'><b>###</b></th><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'>Name</th><th style='border-bottom:2px solid #E16543;'>URL</th></tr>"

    ted_link = 'http://www.ted.com/'

    for anchor in link:
            print >> outfile, "<tr style='border-bottom:1px solid #000;'><td style='border-right:1px solid #000;'>%s</td><td style='border-right:1px solid #000;'>%s</td><td>http://www.ted.com%s</td></tr>" % (count, anchor['title'], anchor['href'])

    count = count + 1

    print >> outfile, """</table>
                    </body>
                    </html>"""

    page_count = page_count + 1

代码看起来很好,减去两件事:

  1. 计数似乎没有增加。它只通过并找到第一页的内容,即:前十个,而不是三十个视频。为什么呢?

  2. 这段代码给了我很多错误。我不知道如何在逻辑上实现我想要的东西(使用urlopen(“%s%d”):

  3. 代码:

    total_pages = 3
    page_count = 1
    count = 1
    
    url = 'http://www.ted.com/talks?page='
    
    while page_count < total_pages:
    
    page = urllib2.urlopen("%s%d") %(url, page_count)
    

1 个答案:

答案 0 :(得分:1)

首先,简化循环并消除一些变量,在这种情况下相当于样板:

for pagenum in xrange(1, 4):  # The 4 is annoying, write it as 3+1 if you like.
  url = "http://www.ted.com/talks?page=%d" % pagenum
  # do stuff with url

但是让我们在循环之外打开文件,而不是每次迭代重新打开它。这就是为什么你只看到10个结果:按照你的想法,会话11-20而不是前10个。 (这将是21-30,除了你在page_count < total_pages上进行循环,它只处理了前两页。)

立即收集所有链接,然后写入输出。我已经删除了HTML样式,这也使得代码更容易理解;相反,使用CSS,可能是内联&lt; style&gt;元素,或者如果你愿意,可以添加它。

import urllib2
from cgi import escape  # Important!
from BeautifulSoup import BeautifulSoup

def is_talk_anchor(tag):
  return tag.name == "a" and tag.findParent("dt", "thumbnail")
links = []
for pagenum in xrange(1, 4):
  soup = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks?page=%d" % pagenum))
  links.extend(soup.findAll(is_talk_anchor))

out = open("test.html", "w")

print >>out, """<html><head><title>TED Talks Index</title></head>
<body>
<table>
<tr><th>#</th><th>Name</th><th>URL</th></tr>"""

for x, a in enumerate(links):
  print >>out, "<tr><td>%d</td><td>%s</td><td>http://www.ted.com%s</td></tr>" % (x + 1, escape(a["title"]), escape(a["href"]))

print >>out, "</table>"

# Or, as an ordered list:
print >>out, "<ol>"
for a in links:
  print >>out, """<li><a href="http://www.ted.com%s">%s</a></li>""" % (escape(a["href"], True), escape(a["title"]))
print >>out, "</ol>"

print >>out, "</body></html>"