Question

我正在尝试使用以下BeautifulSoup脚本找到前30个TED视频（视频名称和URL）：

import urllib2
from BeautifulSoup import BeautifulSoup

total_pages = 3
page_count = 1
count = 1

url = 'http://www.ted.com/talks?page='

while page_count < total_pages:

    page = urllib2.urlopen("%s%d") %(url, page_count)

    soup = BeautifulSoup(page)

    link = soup.findAll(lambda tag: tag.name == 'a' and tag.findParent('dt', 'thumbnail'))

    outfile = open("test.html", "w")

    print >> outfile, """<head>
            <head>
                    <title>TED Talks Index</title>
            </head>

            <body>

            <br><br><center>

            <table cellpadding=15 cellspacing=0 style='border:1px solid #000;'>"""

    print >> outfile, "<tr><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'><b>###</b></th><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'>Name</th><th style='border-bottom:2px solid #E16543;'>URL</th></tr>"

    ted_link = 'http://www.ted.com/'

    for anchor in link:
            print >> outfile, "<tr style='border-bottom:1px solid #000;'><td style='border-right:1px solid #000;'>%s</td><td style='border-right:1px solid #000;'>%s</td><td>http://www.ted.com%s</td></tr>" % (count, anchor['title'], anchor['href'])

    count = count + 1

    print >> outfile, """</table>
                    </body>
                    </html>"""

    page_count = page_count + 1

代码看起来很好，减去两件事：

计数似乎没有增加。它只通过并找到第一页的内容，即：前十个，而不是三十个视频。为什么呢？
这段代码给了我很多错误。我不知道如何在逻辑上实现我想要的东西（使用urlopen（“％s％d”）：

代码：

total_pages = 3
page_count = 1
count = 1

url = 'http://www.ted.com/talks?page='

while page_count < total_pages:

page = urllib2.urlopen("%s%d") %(url, page_count)

Answer 1

首先，简化循环并消除一些变量，在这种情况下相当于样板：

for pagenum in xrange(1, 4):  # The 4 is annoying, write it as 3+1 if you like.
  url = "http://www.ted.com/talks?page=%d" % pagenum
  # do stuff with url

但是让我们在循环之外打开文件，而不是每次迭代重新打开它。这就是为什么你只看到10个结果：按照你的想法，会话11-20而不是前10个。（这将是21-30，除了你在page_count < total_pages上进行循环，它只处理了前两页。）

立即收集所有链接，然后写入输出。我已经删除了HTML样式，这也使得代码更容易理解;相反，使用CSS，可能是内联＆lt; style＆gt;元素，或者如果你愿意，可以添加它。

import urllib2
from cgi import escape  # Important!
from BeautifulSoup import BeautifulSoup

def is_talk_anchor(tag):
  return tag.name == "a" and tag.findParent("dt", "thumbnail")
links = []
for pagenum in xrange(1, 4):
  soup = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks?page=%d" % pagenum))
  links.extend(soup.findAll(is_talk_anchor))

out = open("test.html", "w")

print >>out, """<html><head><title>TED Talks Index</title></head>
<body>
<table>
<tr><th>#</th><th>Name</th><th>URL</th></tr>"""

for x, a in enumerate(links):
  print >>out, "<tr><td>%d</td><td>%s</td><td>http://www.ted.com%s</td></tr>" % (x + 1, escape(a["title"]), escape(a["href"]))

print >>out, "</table>"

# Or, as an ordered list:
print >>out, "<ol>"
for a in links:
  print >>out, """<li><a href="http://www.ted.com%s">%s</a></li>""" % (escape(a["href"], True), escape(a["title"]))
print >>out, "</ol>"

print >>out, "</body></html>"

美丽的汤和声明

1 个答案: