提取头标签

时间:2012-11-10 16:39:11

标签: python beautifulsoup

我尝试了很多东西,但我无法提取head的内容。有人帮忙吗?

原始XML: https://dl.dropbox.com/u/3482709/English_sense_induction.xml.zip

以下是文字:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE corpus SYSTEM "sense-induction.dtd">
<corpus lang="en">
  <lexelt item="explain.v">
    <instance id="explain.v.4" corpus="wsj">
For OPEC , that 's ideal . The resulting firm prices and stability `` will allow both producers and consumers to plan confidently , '' says Saudi Arabian Oil Minister Hisham Nazer . OPEC Secretary-General Subroto <head> explains </head> : Consumers offer security of markets , while OPEC provides security of supply . `` This is an opportune time to find mutual ways { to prevent } price shocks from happening again , '' he says . To promote this balance , OPEC now is finally confronting a long-simmering internal problem .
</instance>
    <instance id="explain.v.10" corpus="wsj">
and given the right conditions , sympathetic to creating some form of life . Surely at some other cosmic address a Gouldoid creature would have risen out of the ooze to <head> explain </head> why , paleontologically speaking , `` it is , indeed , a wonderful life . '' Mr. Holt is a columnist for the Literary Review in London .
</instance>
    <instance id="explain.v.76" corpus="wsj">
`` You ca n't build on your hit-and-miss five-seventeen '' . `` What are you playing '' ? ? Owen asked . `` I 'm just logging '' , the cowboy <head> explained </head> . `` I keep all these plays in this little black book , and I watch over a twelve-hour period to find out what numbers are repeating . But roulette 's not my game .
</instance>
  </lexelt>
  <lexelt item="position.n">
    <instance id="position.n.288" corpus="wsj">
But not everybody was making money . The carnage on the Chicago Board Options Exchange , the nation 's major options market , was heavy after the trading in S&amp;P 100 stock-index options was halted Friday . Many market makers in the S&amp;P 100 index options contract had bullish <head> positions </head> Friday , 
</instance>
    <instance id="position.n.123" corpus="wsj">
An explosion at the Microbiology and Virology Institute in Sverdlovsk released anthrax germs that caused a significant number of deaths . Since Mr. Shevardnadze did not address this topic before the Supreme Soviet , the Soviet Union 's official <head> position </head> remains that the anthrax deaths were caused by 
</instance>
  </lexelt>
</corpus>

修改

问题是我忘了xml作为第二个论点:解决方案是soup = BeautifulSoup(xml_data, 'xml')

2 个答案:

答案 0 :(得分:1)

from bs4 import BeautifulSoup

soup = BeautifulSoup(xml_data, 'xml')
head_datas = [head.get_text() for head in soup.find_all('head')]

head_datas
>>> [' explains ', ' explain ', ' explained ', ' positions ', ' position ']

如果.string只包含一个字符串的子项,您还可以使用<head>属性:

head_datas = [head.string for head in soup.find_all('head')]

答案 1 :(得分:1)

>>> t = '''<?xml ...'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(t)
>>> soup.find('head')
<head> explains </head>
>>> _.text
' explains '

当您使用有效的XML结构时,您还可以使用其他XML解析器,例如ElementTree:

>>> from xml.etree import ElementTree
>>> tree = ElementTree.fromstring(t)
>>> tree.find('.//head')
<Element 'head' at 0x00000000031226D8>
>>> _.text
' explains '