Python RE在/ ref =之后不返回任何内容

时间:2014-07-14 18:45:06

标签: python regex python-2.7

我正在尝试从亚马逊的畅销商品列表中检索网址和类别名称。出于某种原因,当我遇到/ref=并且我真的不明白为什么时,我正在使用RE停止?我在Windows 7的盒子上使用Python 2.7。

典型的记录是

<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Appstore for Android</a></li>

我的RE是

Regex = "<li><a href='(http://www.amazon.ca/Best-Sellers.*?)'>(.*?)</a></li>"
Category = re.compile(Regex)

返回一个元组

[][0] http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps
[][1] Appstore for Android

我确实获得了所有正确的记录,但正如您所看到的,网址缺少/ref=zg_bs_nav_0

类别层次结构中的其他级别表现出相同的问题; URL中的所有内容,以及包括/ ref =开头都没有。

在我采取Martijn的建议之后,这是我的代码片段

# First page of the list of Best Sellers categories
URL = "http://www.amazon.ca/gp/bestsellers"

# Retrieve the page source
HTMLFile = urllib.urlopen(URL)
HTMLText = HTMLFile.read()

soup = BeautifulSoup(HTMLText)
for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
    print link['href']
    print link.get_text()

1 个答案:

答案 0 :(得分:4)

您正在使用正则表达式,但将XML与此类表达式匹配会变得太复杂,太快。别这么做。

使用HTML解析器,Python有几种可供选择:

后两者也非常优雅地处理格式错误的HTML,对许多拙劣的网站产生了不错的感觉。实际上,如果安装了BeautifulSoup 4,则使用lxml作为首选解析器。

BeautifulSoup示例:

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmlsource)
for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
    print link['href'], link.get_text()

这使用CSS选择器查找<a>元素中直接包含的所有<li>元素,其中href属性以文本http://www.amazon.ca/Best-Sellers开头。

演示:

>>> from bs4 import BeautifulSoup
>>> htmlsource = '<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Appstore for Android</a></li>'
>>> soup = BeautifulSoup(htmlsource)
>>> for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
...     print link['href'], link.get_text()
... 
http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0 Appstore for Android

请注意,亚马逊会根据标题更改响应:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.amazon.ca/gp/bestsellers')
>>> soup = BeautifulSoup(r.content)
>>> soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]
<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps">Appstore for Android</a>
>>> r = requests.get('http://www.amazon.ca/gp/bestsellers', headers={
...     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})
>>> soup = BeautifulSoup(r.content)
>>> soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]
<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0/185-3312534-9864113">Appstore for Android</a>