如何在python中获取div标签中的标签?

时间:2016-02-24 12:40:32

标签: python regex beautifulsoup

我正在使用python抓取网站http://i.cantonfair.org.cn/en/expexhibitorlist.aspx?categoryno=411。 我想在div标签中找到一个链接,其中有两个标签,如:

<div id="main_category">
  <div class="tit1"><a href="#" onclick="ExpandStage(1);"><strong>Phase 1</strong><br />April 15 - 19</a></div>
  <ul id="phase1">   
    <li><a href="expexhibitorlist.aspx?categoryno=411">Consumer Electronics and Information Products</a></li>
    <li><a href="expexhibitorlist.aspx?categoryno=412">Electronic and Electrical Products</a></li>

我只想要所有的标签,如

<a href="expexhibitorlist.aspx?categoryno=411">Consumer Electronics and Information Products</a>

。另外如何使用正则表达式查找这些URL?

我正在尝试这样

from bs4 import BeautifulSoup
import re
import urllib.request
r = urllib.request.urlopen('http://i.cantonfair.org.cn/en/expexhibitorlist.aspx?categoryno=410').read()
soup = BeautifulSoup(r, "html.parser")
letters = soup.find_all("div",{"id":"main_category"})
for element in letters:
categories = element.a.get_text()
print (categories)

1 个答案:

答案 0 :(得分:0)

我使用的是python 2.7,以下内容适用于我。 Python 3可以使用相同的方法。希望它有所帮助:

from bs4 import BeautifulSoup as bs
from urllib2 import urlopen
r = urlopen('http://i.cantonfair.org.cn/en/expexhibitorlist.aspx?categoryno=410').read()
soup = bs(r, "lxml")
lis = soup.find_all("li")
hrefs = [c.a['href'] for c in lis]
print hrefs
相关问题