如何解析多个XML文件

时间:2014-06-10 01:45:10

标签: ruby xml nokogiri

我正在尝试使用Nokogiri解析多个XML文件。它们采用以下格式:

<?xml version="1.0" encoding="UTF-8"?>
<CRDoc>[Congressional Record Volume<volume>141</volume>, Number<number>213</number>(<weekday>Sunday</weekday>,<month>December</month>
  <day>31</day>,<year>1995</year>)]
[<chamber>Senate</chamber>]
[Page<pages>S19323</pages>]<congress>104</congress>
  <session>1</session>
  <document_title>UNANIMOUS-CONSENT REQUEST--HOUSE MESSAGE ON S. 1508</document_title>
  <speaker name="Mr. DASCHLE">Mr. DASCHLE</speaker>.<speaking name="Mr. DASCHLE">Mr. President, I said this on the floor yesterday 
afternoon, and I will repeat it this afternoon. I know that the 
distinguished majority leader wants an agreement as much as I do, and I 
do not hold him personally responsible for the fact that we are not 
able to overcome this impasse. I commend him for his efforts at trying 
to do so again today.</speaking>
  <speaking name="Mr. DASCHLE">Let me try one other option. We have already been unable to agree to 
a continuing resolution that would have put all Federal employees back 
to work with pay. We have been unable to agree to something that we 
agreed to last Friday, the 22d of December, which would have at least 
sent them back to their offices without pay. Perhaps we can try this.</speaking>
  <speaking name="Mr. DASCHLE">I ask unanimous consent that the Senate proceed to the message from 
the House on S. 1508, that the Senate concur in the House amendment 
with a substitute amendment that includes the text of Senator Dole's 
back-to-work bill, and the House-passed expedited procedures shall take 
effect only if the budget agreement does not cut Medicare more than 
necessary to ensure the solvency of the Medicare part A trust fund and, 
second, does not raise taxes on working Americans, does not cut funding 
for education or environmental enforcement, and maintains the 
individual health guarantee under Medicaid and, third, provides that 
any tax reductions in the budget agreement go only to Americans making 
under $100,000; that the motion to concur be agreed to, and the motion 
to reconsider be laid upon the table.</speaking>
  <speaker name="The ACTING PRESIDENT pro tempore">The ACTING PRESIDENT pro tempore</speaker>.<speaking name="The ACTING PRESIDENT pro tempore">Is there objection?</speaking>
  <speaker name="Mr. DOLE">Mr. DOLE</speaker>.<speaking name="Mr. DOLE">Mr. President, I want to say a few words. But I will 
object.</speaking>
  <speaking name="Mr. DOLE">We are working on a lot of these things in our meetings at the White 
House, where we have both been for a number of hours. I think we have 
made some progress. We are a long way from any solution yet.</speaking>
  <speaking name="Mr. DOLE">I think all of the things listed by the Democratic leader are areas 
of concern in the meetings we have had. And the meetings will start 
again on Tuesday. But it seems to me that it would not be appropriate 
to proceed under those terms, and therefore I object.</speaking>
  <speaker name="The ACTING PRESIDENT pro tempore">The ACTING PRESIDENT pro tempore</speaker>.<speaking name="The ACTING PRESIDENT pro tempore">Objection is heard.</speaking>
</CRDoc>

我使用的代码来自之前的帮助,到目前为止已经过了一段时间。但是,XML文件的格式已更改,并使代码无法使用。我的代码就是:

doc.xpath("//speech/speaking/@name").map(&:text).uniq.each do |name|
  speaker = Nokogiri::XML('<root/>')
  doc.xpath('//speech').each do |speech|
    speech_node = Nokogiri::XML('<speech/>')
    speech.xpath("*[@name='#{name}']").each do |speaking|
      speech_node.root.add_child(speaking)
    end
    speaker.root.add_child(speech_node.root) unless speech_node.root.children.empty?
  end
  File.open("test/" + name + "-" + year + ".xml", 'a+') do |f|
    f.write speaker.root.children
  end
end

我想为每个发言者创建一个新的XML文件,并在每个新的XML文件中都有他们所说的内容。代码需要能够遍历目录中的各种XML文件,并将每个语音放在适当的扬声器文件中。我以为这可以用find -exec命令完成。

最终,代码应该:

  1. 使用发言人姓名和年份创建一个XML文件,即Mr. Boehner_2011.xml
  2. XML文件将保存他当年的所有演讲。
  3. XML文件将具有CRDoc根节点。

2 个答案:

答案 0 :(得分:4)

我的建议是,不是继续使用您不理解的代码,而是将其分解为比特以便更容易理解,或者至少更容易隔离问题。

想象一下能够做到这一点:

crdoc = CongressionalRecordDocument.new(filename)

crdoc.year
#=> 1995

crdoc.speakers
#=> ["Mr. DASCHLE", "The ACTING PRESIDENT pro tempore", "Mr. DOLE"]

crdoc.speakers.each do |speaker|
  speech = crdoc.speaking_parts(speaker)
  #save speech to file
end

这隐藏了细节,使其更容易阅读。更好的是,它会划分它们,所以如果你检索说话者列表的方式发生了变化,例如,你只需要改变一个小部分,那部分就很容易测试。我们来实现它:

class CongressionalRecordDocument

  def initialize(xml_file)
    @doc = Nokogiri::XML(xml_file)
  end

  def year
    @year ||= @doc.at('//year')
  end

  def speakers
    @speakers ||= @doc.xpath('//speaker/@name').map(&:text).uniq
  end

  def speaking_parts(speaker)
    @doc.xpath("//speaking[@name = '#{speaker}']").map(&:text)
  end
end

现在看起来复杂得多,不是吗?您可能还想以类似的方式为 new 文档创建一个类,因此创建输出非常简单。

此外,您可能希望在ruby中找到您的文件而不是find -exec

Dir["/path/to/search/*.xml"].each do |file|
  crdoc = CongressionalRecordDocument.new(file)
  #etc
end

答案 1 :(得分:1)

由于您不再拥有<speech>元素,因此您需要将其从代码中删除:

doc.xpath("//speaking/@name").map(&:text).uniq.each do |name|
  speaker = Nokogiri::XML('<root/>')
  doc.xpath('//CRDoc').each do |speech|
    speech_node = Nokogiri::XML('<speech/>')
    speech.xpath("*[@name='#{name}']").each do |speaking|
      speech_node.root.add_child(speaking)
    end
    speaker.root.add_child(speech_node.root) unless speech_node.root.children.empty?
  end
  File.open("test/" + name + "-" + year + ".xml", 'a+') do |f|
    f.write speaker.root.children
  end
end