我有一个锚标记如下:
<a class="gsc_a_at" href= "/citations?view_op=view_citation&hl=en&user=11JgipcAAAAJ&pagesize=100&citation_for_view=11JgipcAAAAJ:j3f4tGmQtD8C">'''
我想使用citation_for_view
在beautifulSoup
之后提取内容。如果没有regular expressions
我该怎么办?
以下是我的尝试。
input_data =&#39;&#39;&#39; &#39;&#39;&#39;
#!/usr/bin/python
from bs4 import BeautifulSoup
soup = BeautifulSoup(input_data)
for href_tags in soup.find_all('a',href=True):
print href_tags['href']
输出:
/citations?view_op=view_citation&hl=en&user=11JgipcAAAAJ&pagesize=100&citation_for_view=11JgipcAAAAJ:j3f4tGmQtD8C
如何提取citation_for_view
内容href
并且仅输出11JgipcAAAAJ:j3f4tGmQtD8C
答案 0 :(得分:2)
您可以使用urlparse
>>> import urlparse
>>> url = '/citations?view_op=view_citation&hl=en&user=11JgipcAAAAJ&pagesize=100&citation_for_view=11JgipcAAAAJ:j3f4tGmQtD8C'
>>> vals = urlparse.parse_qs(url)
>>> print vals.get('citation_for_view')
['11JgipcAAAAJ:j3f4tGmQtD8C']