Question

我需要解析HTML页面，让所有网址符合我的要求。

现在，我需要解析每个提取的URL以获取我想要的数据，如果页面标题匹配某些内容并根据其名称将它们保存到多个文件中。我以下列方式完成了第1部分。

pattern=re.compile(r'''class="topline"><A href="(.*?)"''')
da = pattern.search(web_page)
da = pattern.findall(soup1)
col_width = max(len(word) for row in da for word in row)
for row in da:
    if "some string" in row.upper():
        bb = "".join(row.ljust(col_width))
        print >> links, bb

我真的很感激任何帮助。谢谢。

Answer 1

首先，do not parse HTML with regex。您实际上已使用BeautifulSoup标记标记了问题，但您仍在此处使用正则表达式。

以下是如何获取链接，关注它们并查看title：

from urllib2 import urlopen
from bs4 import BeautifulSoup

URL = "url here"

soup = BeautifulSoup(urlopen(URL))
links = soup.select('.topline > a')
for a in links:
    link = link.get('href')
    if link:
        # follow link
        link_soup = BeautifulSoup(urlopen(link))
        title = link_soup.find('title')
        # check title

.topline > a CSS selector会找到包含topline类的任何标记，并在其下方显示a标记。

希望有所帮助。

解析多个URL并提取数据

1 个答案: