Question

我正在尝试从HTML代码中获取所有href，并将其存储在列表中以供将来处理，例如：

示例网址：www.example-page-xl.com

 <body>
    <section>
    <a href="/helloworld/index.php"> Hello World </a>
    </section>
 </body>

我正在使用以下代码列出href：

import bs4 as bs4
import urllib.request

sauce = urllib.request.urlopen('https:www.example-page-xl.com').read()
soup = bs.BeautifulSoup(sauce,'lxml')

section = soup.section

for url in section.find_all('a'):
    print(url.get('href'))

但是我想将网址存储为： www.example-page-xl.com/helloworld/index.php而不仅仅是相对路径/helloworld/index.php

不需要使用相对路径追加/加入URL，因为当我加入URL和相对路径时，动态链接可能会有所不同。

简而言之，我想刮掉绝对的URL，而不是单独的相对路径（并且没有加入）

Answer 1

在这种情况下， urlparse.urljoin 可以帮助您。您应该像这样修改您的代码 -

import bs4 as bs4
import urllib.request
from urlparse import  urljoin

web_url = 'https:www.example-page-xl.com'
sauce = urllib.request.urlopen(web_url).read()
soup = bs.BeautifulSoup(sauce,'lxml')

section = soup.section

for url in section.find_all('a'):
    print urljoin(web_url,url.get('href'))

这里 urljoin 管理绝对和相对路径。

Answer 2

urllib.parse.urljoin（）可能有帮助。它进行连接，但它很聪明并处理相对路径和绝对路径。注意这是python 3代码。

>>> import urllib.parse
>>> base = 'https://www.example-page-xl.com'

>>> urllib.parse.urljoin(base, '/helloworld/index.php') 
'https://www.example-page-xl.com/helloworld/index.php'

>>> urllib.parse.urljoin(base, 'https://www.example-page-xl.com/helloworld/index.php')
'https://www.example-page-xl.com/helloworld/index.php'

Answer 3

我看到提到的here解决方案是最可靠的。

import urllib.parse

def base_url(url, with_path=False):
    parsed = urllib.parse.urlparse(url)
    path   = '/'.join(parsed.path.split('/')[:-1]) if with_path else ''
    parsed = parsed._replace(path=path)
    parsed = parsed._replace(params='')
    parsed = parsed._replace(query='')
    parsed = parsed._replace(fragment='')
    return parsed.geturl()

在python中刮取绝对URL而不是相对路径

3 个答案: