如何从python

时间:2016-09-02 09:38:08

标签: python html python-3.x beautifulsoup

我想获得网站形式onclick功能的href链接 这是html代码,其中onclick函数调用一个网站

<div class="fl">
  <span class="taLnk" onclick="ta.trackEventOnPage('Eatery_Listing', 'Website', 594024, 1); ta.util.cookie.setPIDCookie(15190); ta.call('ta.util.link.targetBlank', event, this, {'aHref':'LqMWJQzZYUWJQpEcYGII26XombQQoqnQQQQoqnqgoqnQQQQoqnQQQQoqnQQQQoqnqgoqnQQQQoqnQQuuuQQoqnQQQQoqnxioqnQQQQoqnQQ2EisSMVCnVcJQQoqnQQQQoqnxioqnQQQQoqnQQniaQQoqnQQQQoqnqgoqnQQQQoqnQQWJQzhYMJkH3KHVAdJJH3VVdB', 'isAsdf':true})">Website</span> 
</div>

Normaly我使用此代码从任何范围或元素

获取href链接

geturl = soup.findsoup("span", {"class": "taLnk"})
for link in geturl:
  hreflink = link.get("href")
  print(hreflink)

但是在这种情况下,没有办法直接调用href,因为href存在于onclick函数

请帮助我现在该做什么

2 个答案:

答案 0 :(得分:0)

您无法直接解析aHref属性,您需要先提取onclick

>>> import re
>>> data = soup.select('.taLnk')[0].get('onclick')
>>> href = re.search(r"(?is)'aHref':'(.*?)'",str(data)).group(1)
'LqMWJQzZYUWJQpEcYGII26XombQQoqnQQQQoqnqgoqnQQQQoqnQQQQoqnQQQQoqnqgoqnQQQQoqnQQuuuQQoqnQQQQoqnxioqnQQQQoqnQQ2EisSMVCnVcJQQoqnQQQQoqnxioqnQQQQoqnQQniaQQoqnQQQQoqnqgoqnQQQQoqnQQWJQzhYMJkH3KHVAdJJH3VVdB'

答案 1 :(得分:0)

您可以将正则表达式与bs4一起使用,选择带有类taLnk的范围和 onclick 属性,以 ta.trackEventOnPage 开头:

h = """<div class="fl">
  <span class="taLnk" onclick="ta.trackEventOnPage('Eatery_Listing', 'Website', 594024, 1); ta.util.cookie.setPIDCookie(15190); ta.call('ta.util.link.targetBlank', event, this, {'aHref':'LqMWJQzZYUWJQpEcYGII26XombQQoqnQQQQoqnqgoqnQQQQoqnQQQQoqnQQQQoqnqgoqnQQQQoqnQQuuuQQoqnQQQQoqnxioqnQQQQoqnQQ2EisSMVCnVcJQQoqnQQQQoqnxioqnQQQQoqnQQniaQQoqnQQQQoqnqgoqnQQQQoqnQQWJQzhYMJkH3KHVAdJJH3VVdB', 'isAsdf':true})">Website</span>
</div>"""

from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(h)

data = soup.select_one("span.taLnk[onclick^=ta.trackEventOnPage]")["onclick"]
print(re.search("'aHref':'(.*?)'", data).group(1))