正则表达式,找到<a> tags</a>中的所有“href”

时间:2013-12-30 14:19:17

标签: python regex

我有一个在标签中搜索“href”属性的正则表达式,但它目前效果不佳:

<a[^>]* href="([^"]*)"

从中发现:

<a href="http://something" title="Development of the Python language and website">Core Development</a>

这一行:

<a href="http://something"

但我只需要找到:

http://something

5 个答案:

答案 0 :(得分:7)

这对我有用吗?您可以自己查看工作demo

matches = re.findall(r'<a[^>]* href="([^"]*)"', html)

相反,我会使用Beautiful Soup来实现这一目标......

from bs4 import BeautifulSoup

html = '''
<a href="http://something" title="Development of the Python language and website">Core Development</a>
<a href="http://something.com" title="Development of the Python language and website">Core Development</a>
'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    print a['href']

注意:如果您使用的是旧版的Beautiful Soup,那么请改用以下内容:

for a in soup.findAll('a', href=True):

答案 1 :(得分:3)

试试这个:

re.findall(r'(?<=<a href=")[^"]*',yourStr)

答案 2 :(得分:1)

不重新发明轮子,您可以使用http://www.crummy.com/software/BeautifulSoup/

$ sudo pip install beautifulsoup4
$ python
>>> html_doc = """
... <html><head><title>The Dormouse's story</title></head>
... <body>
... <p class="title"><b>The Dormouse's story</b></p>
... 
... <p class="story">Once upon a time there were three little sisters; and their names were
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
... and they lived at the bottom of a well.</p>
... 
... <p class="story">...</p>
... """
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc)
>>> href = [i.get('href') for i in soup.find_all('a')]
>>> href
['http://example.com/elsie', 'http://example.com/lacie', 'http://example.com/tillie']

如果不安装beautifulsoup打包,您只需从http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.1.tar.gz下载旧版本

$ wget http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.1.tar.gz
$ tar xvzf BeautifulSoup-3.2.1.tar.gz
$ cp BeautifulSoup-3.2.1/BeautifulSoup.py .
$ python
>>> import BeautifulSoup

答案 3 :(得分:0)

您也可以使用(http[s]?:[^"\s]*)

答案 4 :(得分:0)

你可以在re module&amp;中尝试匹配方法。然后使用小组选择你的比赛

    import re
    str1='''<a href="http://something" title="Development of the Python language and website">Core Development</a>'''
    pattern = re.compile(r'<a.*href="(.*)" ')
    m = pattern.match(str1)
    match = m.group(1)
    print match