Question

我最近在python中为一个项目推荐了Beautiful Soup。我一直在美丽的汤页上阅读文档，但我无法理解它想要做的事情。我有一个页面，有一大堆链接。这是一个包含链接，文件大小等的目录。让我们说它看起来像这样：


Parent Directory/       -   Directory
game1.tar.gz    2010-May-24 06:51:39    8.2K    application/octet-stream
game2.tar.gz    2010-Jun-19 09:09:34    542.4K  application/octet-stream
game3.tar.gz    2011-Nov-13 11:53:01    5.5M    application/octet-stream

所以我想要做的是提供一个搜索字符串，让我们说game2，我希望它下载game2.tar.gz。我有想法使用RE，但我听说Beautiful Soup要好得多。任何人都可以展示并解释我将如何做到这一点吗？

Answer 1

from BeautifulSoup import BeautifulSoup  
import urllib2

def searchLinks(url, query_string):
    f = urllib2.urlopen(url)
    soup = BeautifulSoup(f, convertEntities='html')
    for a in soup.findAll('a'):
        if a.has_key('href'):
            idx = a.contents[0].find(query_string)
            if idx is not None and idx > -1:
                yield a['href'] 

res = list(searchLinks('http://example.com', 'game2'))
print res

Answer 2

你的问题不是很清楚。

根据您提供的数据，我认为您只需要这样做：

content = '''Parent Directory/       -   Directory
game1.tar.gz    2010-May-24 06:51:39    8.2K    application/octet-stream
game2.tar.gz    2010-Jun-19 09:09:34    542.4K  application/octet-stream
game3.tar.gz    2011-Nov-13 11:53:01    5.5M    application/octet-stream'''


def what_dir(x, content):
    for line in content.splitlines():
        if x in line.split(None,1)[0]:
            return line.split(None,1)[0]

修改

这对你有帮助吗？：

import urllib
import re

sock = urllib.urlopen('http://pastie.org/pastes/1801547/reply')
content = sock.read()
sock.close()

spa = re.search('<textarea class="pastebox".+?</textarea>',content,re.DOTALL).span()

regx = re.compile('href=&quot;(.+?)&quot;&gt;\\1&lt;')

print regx.findall(content,*spa)

编辑2

或者你想要的是什么？：

import urllib
import re

sock = urllib.urlopen('http://pastie.org/pastes/1801547/reply')
content = sock.read()
sock.close()

spa = re.search('<textarea class="pastebox".+?</textarea>',content,re.DOTALL).span()
regx = re.compile('href=&quot;(.+?)&quot;&gt;\\1&lt;')
dic = dict((name.split('.')[0],'http://pastie.org/pastes/1801547/'+name)
           for name in regx.findall(content,*spa))
print dic

结果

{'game3': 'http://pastie.org/pastes/1801547/game3.tar.gz',
 'game2': 'http://pastie.org/pastes/1801547/game2.tar.gz',
 'game1': 'http://pastie.org/pastes/1801547/game1.tar.gz'}

Answer 3

YouTube上有很多关于安装和使用Beautiful Soup 4来“刮”的视频。它们非常详细。我还在慢慢地经历它们，但第一个让我安装并运行。
在YouTube上搜索“美丽的汤”。

美丽的汤问题

3 个答案:

修改

编辑2