使用Python请求提取href URL

时间:2015-11-20 01:10:34

标签: python python-3.x xpath python-requests lxml

我想使用python中的requests包从xpath中提取URL。我可以得到文本,但我尝试的没有给出URL。有人可以帮忙吗?

ipdb> webpage.xpath(xpath_url + '/text()')
['Text of the URL']
ipdb> webpage.xpath(xpath_url + '/a()')
*** lxml.etree.XPathEvalError: Invalid expression
ipdb> webpage.xpath(xpath_url + '/href()')
*** lxml.etree.XPathEvalError: Invalid expression
ipdb> webpage.xpath(xpath_url + '/url()')
*** lxml.etree.XPathEvalError: Invalid expression

我使用本教程开始:http://docs.python-guide.org/en/latest/scenarios/scrape/

看起来应该很容易,但在搜索过程中什么也没出现。

谢谢。

5 个答案:

答案 0 :(得分:5)

您是否尝试过webpage.xpath(xpath_url + '/@href')

以下是完整代码:

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
webpage = html.fromstring(page.content)

webpage.xpath('//a/@href')

结果应为:

[
  'http://econpy.pythonanywhere.com/ex/002.html',
  'http://econpy.pythonanywhere.com/ex/003.html', 
  'http://econpy.pythonanywhere.com/ex/004.html',
  'http://econpy.pythonanywhere.com/ex/005.html'
]

答案 1 :(得分:1)

使用BeautifulSoup

可以提供更好的服务
- (void)didReceiveData:(NSData *)data Device:(DFBlunoDevice *)dev {

    // setup label to update
    _ticks = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
    [_tickAmount setText:[NSString stringWithFormat:@"Ticks:%@",_ticks]];
    [_tickAmount setNeedsDisplay];
    NSLog(@"ticks = %@",_ticks);
}

您可以打印该行,将其添加到列表等。要迭代它,请使用:

from bs4 import BeautifulSoup

html = requests.get('testurl.com')
soup = BeautifulSoup(html, "lxml") # lxml is just the parser for reading the html
soup.find_all('a href') # this is the line that does what you want

答案 2 :(得分:0)

src/main/resources

Requests-HTML

答案 3 :(得分:0)

具有上下文管理器的优势:

with requests_html.HTMLSession() as s:
    try:
        r = s.get('http://econpy.pythonanywhere.com/ex/001.html')
        links = r.html.links
        for link in links:
            print(link)
    except:
        pass

答案 4 :(得分:0)

您可以轻松地使用硒。

link = webpage.find_elemnt_by_xpath(*xpath url to element with link)
url = link.get_attribute('href')