如何使用lxml在XHTML文档中查找元素文本

时间:2011-01-23 01:31:55

标签: python xpath lxml

我多年来一直在抨击我,我一定是在做一些愚蠢的事。

我正在尝试检索所有可能的维基百科支持的语言,并通过遍历List_of_Wikipedias上的表格将它们输出到文本文件。

到目前为止,这是我的python代码,它只是试图检索其中一个表:

import httplib
from lxml import etree

def main():
    conn = httplib.HTTPConnection("meta.wikimedia.org")
    conn.request("GET","/wiki/List_of_Wikipedias")
    res = conn.getresponse()
    root = etree.fromstring(res.read())
    table = root.xpath('//table')
    print table

main()

在我的机器上,这只打印一个空列表。为了提高速度,我在本地缓存了页面并使用了:

wikipage = open("wikipage.html")
root = lxml.parse(wikipage)

但这不会产生任何影响(除了显而易见的加速)。我也试过

lxml.find('table')

for element in root.iter():
    print("%s - %s" % (element.tag, element.text))

成功打印出所有元素,因此我知道正在创建树。

我做错了什么?

任何帮助将不胜感激。 感谢。

3 个答案:

答案 0 :(得分:3)

I am trying to retrieve all of the possible Wikipedia supported languages and output them to a text file by traversing the tables on List_of_Wikipedias

您的问题是文档中的元素名称位于默认命名空间中。如何编写涉及这些元素名称的XPath表达式是XPath中最常见的FAQ,并且在SO xpath标记中有很多好的答案。只是搜索它们。

这是一个完整的解决方案:

使用

(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()

您已注册绑定到前缀"http://www.w3.org/1999/xhtml"的XHTML名称空间("x")。

当我根据从http://s23.org/wikistats/wikipedias_html

获得的文档评估此XPath表达式时

我需要在文档开头添加以下内容,因为我在本地工作并且没有XDML的DTD - 也许你不需要这些:

<!DOCTYPE html [
<!ENTITY uarr "&#8593;">
<!ENTITY darr "&#8595;">
<!ENTITY ccedil "&#199;">
<!ENTITY oslash "&#216;">
<!ENTITY aacute "&#225;">
<!ENTITY aring "&#229;">
<!ENTITY agrave "&#192;">
<!ENTITY egrave "&#232;">
<!ENTITY ograve "&#210;">
<!ENTITY ocirc "&#244;">
]>

将上述XPath表达式应用于此文档的结果是

                    English

                    German

                    French

                    Polish

                    Italian

                    Japanese

                    Spanish

                    Portuguese

                    Dutch

                    Russian

                    Swedish

                    Chinese

                    Catalan

                    Norwegian (Bokmål)

                    Finnish

                    Ukrainian

                    Czech

                    Hungarian

                    Romanian

                    Korean

                    Turkish

                    Vietnamese

                    Indonesian

                    Danish

                    Arabic

                    Esperanto

                    Serbian

                    Lithuanian

                    Slovak

                    Volapük

                    Persian

                    Hebrew

                    Bulgarian

                    Slovenian

                    Malay

                    Waray-Waray

                    Croatian

                    Estonian

                    Newar / Nepal Bhasa

                    Simple English

                    Hindi

                    Galician

                    Thai

                    Basque

                    Norwegian (Nynorsk)

                    Aromanian

                    Greek

                    Haitian

                    Azerbaijani

                    Tagalog

                    Latin

                    Telugu

                    Georgian

                    Macedonian

                    Cebuano

                    Serbo-Croatian

                    Breton

                    Piedmontese

                    Marathi

                    Latvian

                    Luxembourgish

                    Javanese

                    Belarusian (Taraškievica)

                    Welsh

                    Icelandic

                    Bosnian

                    Albanian

                    Tamil

                    Belarusian

                    Bishnupriya Manipuri

                    Aragonese

                    Occitan

                    Bengali

                    Swahili

                    Ido

                    Lombard

                    West Frisian

                    Gujarati

                    Afrikaans

                    Low Saxon

                    Malayalam

                    Quechua

                    Sicilian

                    Urdu

                    Kurdish

                    Cantonese

                    Sundanese

                    Asturian

                    Neapolitan

                    Samogitian

                    Armenian

                    Yoruba

                    Irish

                    Chuvash

                    Walloon

                    Nepali

                    Ripuarian

                    Western Panjabi

                    Kannada

                    Tajik

                    Tarantino

                    Venetian

                    Yiddish

                    Scottish Gaelic

                    Tatar

                    Min Nan

                    Ossetian

                    Uzbek

                    Alemannic

                    Kapampangan

                    Sakha

                    Egyptian Arabic

                    Kazakh

                    Maori

                    Limburgian

                    Amharic

                    Nahuatl

                    Upper Sorbian

                    Gilaki

                    Corsican

                    Gan

                    Mongolian

                    Scots

                    Interlingua

                    Central_Bicolano

                    Burmese

                    Faroese

                    Võro

                    Dutch Low Saxon

                    Sinhalese

                    Turkmen

                    West Flemish

                    Sanskrit

                    Bavarian

                    Malagasy

                    Manx

                    Ilokano

                    Divehi

                    Norman

                    Pangasinan

                    Banyumasan

                    Sorani

                    Romansh

                    Northern Sami

                    Zazaki

                    Mazandarani

                    Wu

                    Friulian

                    Uyghur

                    Ligurian

                    Maltese

                    Bihari

                    Novial

                    Tibetan

                    Anglo-Saxon

                    Kashubian

                    Sardinian

                    Classical Chinese

                    Fiji Hindi

                    Khmer

                    Ladino

                    Zamboanga Chavacano

                    Pali

                    Franco-Provençal/Arpitan

                    Pashto

                    Hakka

                    Cornish

                    Punjabi

                    Navajo

                    Silesian

                    Kalmyk

                    Pennsylvania German

                    Hawaiian

                    Saterland Frisian

                    Interlingue

                    Somali

                    Komi

                    Karachay-Balkar

                    Crimean Tatar

                    Tongan

                    Acehnese

                    Meadow Mari

                    Picard

                    Erzya

                    Lingala

                    Kinyarwanda

                    Extremaduran

                    Guarani

                    Kirghiz

                    Emilian-Romagnol

                    Assyrian Neo-Aramaic

                    Papiamentu

                    Aymara

                    Chechen

                    Lojban

                    Wolof

                    Banjar

                    Bashkir

                    North Frisian

                    Greenlandic

                    Tok Pisin

                    Udmurt

                    Kabyle

                    Tahitian

                    Sranan

                    Zealandic

                    Hill Mari

                    Komi-Permyak

                    Lower Sorbian

                    Abkhazian

                    Gagauz

                    Igbo

                    Oriya

                    Lao

                    Kongo

                    Avar

                    Moksha

                    Mirandese

                    Romani

                    Old Church Slavonic

                    Karakalpak

                    Samoan

                    Moldovan

                    Tetum

                    Gothic

                    Kashmiri

                    Bambara

                    Inupiak

                    Sindhi

                    Bislama

                    Lak

                    Nauruan

                    Norfolk

                    Inuktitut

                    Pontic

                    Assamese

                    Cherokee

                    Min Dong

                    Swati

                    Palatinate German

                    Hausa

                    Ewe

                    Tigrinya

                    Oromo

                    Zulu

                    Zhuang

                    Venda

                    Tsonga

                    Kirundi

                    Dzongkha

                    Sango

                    Cree

                    Chamorro

                    Luganda

                    Buginese

                    Buryat (Russia)

                    Fijian

                    Chichewa

                    Akan

                    Sesotho

                    Xhosa

                    Fula

                    Tswana

                    Kikuyu

                    Tumbuka

                    Shona

                    Twi

                    Cheyenne

                    Ndonga

                    Sichuan Yi

                    Choctaw

                    Marshallese

                    Afar

                    Kuanyama

                    Hiri Motu

                    Muscogee

                    Kanuri

                    Herero

请注意:每隔一个选定节点是仅限空格的文本节点。如果您不想选择这些,请使用:

(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()[normalize-space()]

答案 1 :(得分:3)

将其解析为html。

from lxml import html

url = 'http://meta.wikimedia.org/wiki/List_of_Wikipedias'
tree = html.parse(url)
languages = tree.xpath('//table/tr/td[2]/a/text()')
print('\n'.join(languages))

输出

English
German
French
Polish
Italian
Japanese
Spanish
Portuguese
Dutch
Russian
Swedish
Chinese
Catalan
Norwegian (Bokmål)
Finnish
Ukrainian
Czech
Hungarian
Romanian
Korean
Turkish
Vietnamese
Indonesian
Danish
Arabic
Esperanto
Serbian
Lithuanian
Slovak
Volapük
Persian
Hebrew
Bulgarian
Slovenian
Malay
Waray-Waray
Croatian
Estonian
Newar / Nepal Bhasa
Simple English
Hindi
Galician
Thai
Basque
Norwegian (Nynorsk)
Aromanian
Greek
Haitian
Azerbaijani
Tagalog
Latin
Telugu
Georgian
Macedonian
Cebuano
Serbo-Croatian
Breton
Piedmontese
Marathi
Latvian
Luxembourgish
Javanese
Belarusian (Taraškievica)
Welsh
Icelandic
Bosnian
Albanian
Tamil
Belarusian
Bishnupriya Manipuri
Aragonese
Occitan
Bengali
Swahili
Ido
Lombard
West Frisian
Gujarati
Afrikaans
Low Saxon
Malayalam
Quechua
Sicilian
Urdu
Kurdish
Cantonese
Sundanese
Asturian
Neapolitan
Samogitian
Armenian
Yoruba
Irish
Chuvash
Walloon
Nepali
Ripuarian
Western Panjabi
Kannada
Tajik
Tarantino
Venetian
Yiddish
Scottish Gaelic
Tatar
Min Nan
Ossetian
Uzbek
Alemannic
Kapampangan
Sakha
Kazakh
Egyptian Arabic
Maori
Amharic
Limburgian
Nahuatl
Upper Sorbian
Gilaki
Corsican
Gan
Mongolian
Scots
Interlingua
Central_Bicolano
Burmese
Faroese
Võro
Dutch Low Saxon
Sinhalese
Turkmen
West Flemish
Sanskrit
Bavarian
Malagasy
Manx
Ilokano
Divehi
Norman
Pangasinan
Banyumasan
Sorani
Romansh
Northern Sami
Zazaki
Mazandarani
Wu
Friulian
Uyghur
Ligurian
Maltese
Bihari
Novial
Tibetan
Anglo-Saxon
Kashubian
Sardinian
Classical Chinese
Fiji Hindi
Khmer
Ladino
Zamboanga Chavacano
Pali
Franco-Provençal/Arpitan
Pashto
Hakka
Cornish
Punjabi
Navajo
Silesian
Kalmyk
Pennsylvania German
Hawaiian
Saterland Frisian
Interlingue
Somali
Komi
Karachay-Balkar
Crimean Tatar
Tongan
Acehnese
Meadow Mari
Picard
Kinyarwanda
Erzya
Lingala
Extremaduran
Guarani
Kirghiz
Emilian-Romagnol
Assyrian Neo-Aramaic
Papiamentu
Aymara
Chechen
Lojban
Wolof
Banjar
Bashkir
North Frisian
Greenlandic
Tok Pisin
Udmurt
Kabyle
Tahitian
Sranan
Zealandic
Hill Mari
Komi-Permyak
Lower Sorbian
Abkhazian
Gagauz
Igbo
Oriya
Lao
Kongo
Avar
Moksha
Mirandese
Romani
Old Church Slavonic
Karakalpak
Samoan
Moldovan
Tetum
Gothic
Kashmiri
Bambara
Inupiak
Sindhi
Bislama
Lak
Nauruan
Norfolk
Inuktitut
Pontic
Assamese
Cherokee
Min Dong
Palatinate German
Swati
Hausa
Ewe
Tigrinya
Oromo
Zulu
Zhuang
Venda
Tsonga
Kirundi
Cree
Dzongkha
Sango
Chamorro
Luganda
Buginese
Buryat (Russia)
Fijian
Chichewa
Akan
Sesotho
Xhosa
Fula
Tswana
Kikuyu
Tumbuka
Shona
Twi
Cheyenne
Ndonga
Sichuan Yi
Choctaw
Marshallese
Afar
Kuanyama
Hiri Motu
Muscogee
Kanuri
Herero

答案 2 :(得分:0)

XPath需要名称空间。您下载的页面开始:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" dir="ltr">

所以你真的想要

xpath('//html:table')

其中html是绑定到"http://www.w3.org/1999/xhtml"

的前缀

你必须找到如何在lxml中绑定名称空间 - 我不是python专家。

如果这是你的问题我同情 - 它已经把我和其他许多人赶了出来!