Question

我使用 Python 2.7 和 Beautiful Soup 3.2 ，我得到了以下抓取工具来获取流网址：

# Import the classes that are needed
import urllib2
from BeautifulSoup import BeautifulSoup

# URL to scrape and open it with the urllib2
url = 'http://www.wiziwig.tv/broadcast.php?matchid=219751&part=sports'
source = urllib2.urlopen(url)

# Turn the saved source into a BeautifulSoup object
soup = BeautifulSoup(source)

for tr in soup.findAll('tr', {'class': ['broadcast']}):
    stationName = tr.findAll('td')[1].text

    for trBelow in tr.findAllNext('tr'):
        curClass = trBelow['class']
        if curClass == 'broadcast':
            break

        kindStream = trBelow.findAll('td')[0].text
        streamUrl = trBelow.find('a', {'class': 'broadcast go'})['href']
        streamQuality = trBelow.findAll('td')[2].text
        streamRating = trBelow.find('div', {'class': 'rating'})['rel']

        print stationName, kindStream, streamQuality, streamRating, streamUrl

这完美地运行并提供以下输出：

BWIN Flash 650 Kbps 100 http://forum.wiziwig.eu/threads/1847-BWIN-Info
BWIN Flash 675 Kbps 100 https://sports.bwin.com/en/sports?wm=3448325&zoneId=1068792
Bet365 Flash 650 Kbps 100 http://forum.wiziwig.eu/threads/6258-Bet365
Bet365 Flash 675 Kbps 100 http://www.bet365.com/?affiliate=365_014110
TRK Ukraine+ AceStream 1250 Kbps 100 acestream://94879770520f2e9db2146d0eca59204bfbd72cbe
TRK Ukraine+ AceStream 1251 Kbps 75 http://aviatortv.org/football_ua_plus/
Arenavision1 Sopcast 2000 Kbps 75 sop://broker.sopcast.com:3912/143876
Arenavision3 AceStream 2000 Kbps 75 acestream://a53a380706846bfc6667e21a1485dedb78b9674b
Arenavision3 AceStream 2001 Kbps 75 http://avod.me/play/a53a380706846bfc6667e21a1485dedb78b9674b
Dazsports Ace2 AceStream 850 Kbps 100 acestream://d293c82146aa6c2904e45ff305ae0f38dc5b329d
Dazsports Ace2 AceStream 851 Kbps 75 http://dazsports.org/ace2.html
Digi Sport1 [RO] Sopcast 1500 Kbps 100 sop://broker.sopcast.com:3912/146141
Digi Sport1 [RO] Sopcast 1500 Kbps 100 sop://broker.sopcast.com:3912/124992
Digi Sport1 [RO] Sopcast 1501 Kbps 100 sop://broker.sopcast.com:3912/139777
Digi Sport1 [RO] Sopcast 1501 Kbps 100 sop://broker.sopcast.com:3912/110152
Pole Position1 [NL] AceStream 1000 Kbps 100 acestream://86fd521d30e9319198b75121761eccf260fef0cb
Pole Position1 [NL] AceStream 1001 Kbps 75 http://polepositionweb.org/?page_id=6 popup
Solodeportes Veetle Veetle 850 Kbps 100 http://veetle.com/index.php/widget/index/E47CFF6CB6A770852515B8B30C2E30F6/0/true/default/false
Livesports4u4 Flash 225 Kbps 75 http://livesport4u.com/stream4.html
Cricfree Flash2 Flash 175 Kbps 75 http://cricfree.tv/live-golf-streaming-ch2.php
Njtvx9 Flash 175 Kbps 75 http://nutjob.eu/njtvx9.html
Igoal C+ Liga Flash 175 Kbps 75 http://ana1.me/liga+.html
Soccertoall2 [PT] Flash 175 Kbps 75 http://soccertoall.net/index.php?channel=2
Tugalive1 Flash 175 Kbps 75 http://www.tugalive.eu/p/live-1.html
Diresport1 Flash 175 Kbps 75 http://diresportt.blogspot.com.es/
Footstream11 Flash 175 Kbps 75 http://www.footstream.tv/channel11.html
Lag10 (8) Flash 150 Kbps 50 http://lag10.com/channel8
ANA STV2 Flash 400 Kbps 75 http://ana1.me/STV2.html
ANA STV2 Flash 400 Kbps 75 http://bliner.tv/sporttv2pt.html
Livesoccerhd4 Flash 225 Kbps 75 http://livesoccerhd.tv/l4.html
Stvstreams Ace HD1 AceStream 1500 Kbps 100 acestream://750acfc788e12220dbd57188505eae08f566281e
Stvstreams Ace HD1 AceStream 1500 Kbps 100 http://stvstreams.com/acestreams/stv-hd/
Btsportshd12 Flash 200 Kbps 75 http://www.btsportshd.com/stream12.php
Ana Stream1 Flash 175 Kbps 75 http://ana3.me/STREAM1.html
Onlinesoccer2all (13) Flash 175 Kbps 75 http://online--soccer.eu/channel13.html
Hdfoots6 Flash 175 Kbps 75 http://hdfoots.com/stream6.html

但我想知道我是否应该这样做，或者是否有更好的方法而不进行下一个循环for trBelow in tr.findAllNext('tr'):然后在遇到特定类时突破它？

Answer 1

我可能只是迭代<tr>个项目：

station_name = ''
for tr in soup.findAll('tr'):
    if tr['class'] == 'broadcast':
        station_name = tr.findAll('td')[1].text
    else:
        # Your current extraction code
        print stationName, kindStream, ....

这样，代码有点清晰，我想。

另一方面......看起来你有一个有效的快速脚本。通过更改实际页面的html输出，它会比代码中的错误更快地破坏。因此，如果它有效，它会起作用，我会说。

Answer 2

我认为您的实施已经很好了。只是一个简单的问题，如果我想重用我收到的一些内容怎么办？我声称“Soup”没有使用内置缓存，如果我想重新运行这个循环它会遍历节点。

这是我的看法：

with soup:
  tr_elements, tr_belows, collection = findAll('tr', {'class': ['broadcast']}) \
                                       [tr.findAllNext('tr') for tr in tr_elements], {}
  collection['station_names'] = [tr.findAll('td').text[1] for tr in tr_elements]
  collection['kind_streams'] = [trb.findAll('td').text[0] for trb in tr_belows]
  ## and so fourth.
  print dict(collection)

这仍然需要一些工作，因为它无法扫描其他节点内的“广播”节点。此外，我的方法的复杂性可以使用一些工作。

循环循环，我能做得更好吗？

2 个答案: