如何在<a> wraps another element using XPath?

时间:2015-07-02 18:17:36

标签: python html xml xpath

The example below reflects data similar to what I'm using (I can't show my live data, due to company policy). It is pulled from this回答和this回答时获取链接和文字。

我的目标是提取<a>元素的文本以及链接本身。

from lxml import html

post1 = """<p><code>Integer.parseInt</code> <em>couldn't</em> do the job, unless you were happy to lose data. Think about what you're asking for here.</p>&#xA;&#xA;<p>Try <a href="http://docs.oracle.com/javase/7/docs/api/java/lang/Long.html#parseLong%28java.lang.String%29"><code>Long.parseLong(String)</code></a> or <a href="http://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html#BigInteger%28java.lang.String%29"><code>new BigInteger(String)</code></a> for really big integers.</p>&#xA;
"""

post2 = """
<p><code>Integer.parseInt</code> <em>couldn't</em> do the job, unless you were happy to lose data. Think about what you're asking for here.</p>&#xA;&#xA;<p>Try <a href="http://docs.oracle.com/javase/7/docs/api/java/lang/Long.html#parseLong%28java.lang.String%29"><code>Long.parseLong(String)</code></a> or <a href="http://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html#BigInteger%28java.lang.String%29"><code>new BigInteger(String)</code></a> for really big integers.</p>&#xA;
"""
doc = html.fromstring(post1)
for link in doc.xpath('//a'):
    print link.text, link.get('href')

不幸的是,这会返回以下内容:

None http://docs.oracle.com/javase/7/docs/api/java/lang/Long.html#parseLong%28java.lang.String%29
None http://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html#BigInteger%28java.lang.String%29

请注意,我的link.text为空。这是因为链接包装<code>块。

如果我使用post2,则会返回正确的结果:

PROJ.4 http://trac.osgeo.org/proj/
OpenSceneGraph http://www.openscenegraph.org/

如何修改循环以处理标准网址(post2)和包含其他对象的链接(post1)?

1 个答案:

答案 0 :(得分:1)

更改

print link.text, link.get('href')

print link.text_content(), link.get('href')

然后你的输出将是

Long.parseLong(String) http://docs.oracle.com/javase/7/docs/api/java/lang/Long.html#parseLong%28java.lang.String%29
new BigInteger(String) http://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html#BigInteger%28java.lang.String%29

对于post1post2的请求。