在给定节点上仅抓取具有特定属性值的条目

时间:2014-01-13 15:35:13

标签: ruby xml xml-parsing nokogiri

我在Nokogiri :: XML :: Reader上使用Xml :: Parser从XML文件中提取条目。我想只抓住“Property / PropertyID / Identification ['OrganizationName'=='northsteppe']”的标签,但是无法弄清楚这样做的正确语法,这是我一直在建的整个rake任务下面是一个示例节点,其下面包含所有信息和标签。任何指导都将非常感谢。

================更新===============

我正在使用open-uri来解析我正在解析的文件,因为它来自外部源,我只是在本地计算机上使用旧版本的硬拷贝,以便在开发过程中加快速度,因为文件是300MB +大小。我试图使用一个SAX解析器,但这个逻辑似乎有点复杂,我真正掌握了正在发生的事情,我遇到了同样的问题,这限制了我只抓住那些'northsteppe'的属性作为识别标签中的OrganizationName,我说,我选择使用当前的方法尝试相同的任务,我能够获取几乎所有需要的信息,我只是错过了上面提到的最后一块。

===============具体可能=============

所以,我觉得好像描述我正在尝试执行的确切任务将有助于消除任何缺失的差距。任务如下。

<Identification>标记中具有OraganizationName ='northsteppe'的XML文件中获取每个属性,然后单独获取与每个属性相关的所有相应信息,并将其插入到哈希中。在收集了单个属性的所有信息并将其放入该哈希后,需要将其作为单个条目上载到数据库,该数据库已经按照需要的方式构建。将该属性插入数据库后,rake任务将移至Property的下一个条目,该条目符合<Identification>标记中具有OrganizationName ='northsteppe'的规范并重复该过程,直到所有符合上述规格的属性都已插入数据库。这样做的目的是让我可以快速搜索Northsteppe属性的数据,而不用使用XML文件中的每个属性来阻塞系统。

最后,我将使用open-uri从其外部源中提取文件并运行一个cron作业,每6个小时执行一次这个rake任务并更换数据库。

================= CODE =================

namespace :db do

# RAKE TASK DESCRIPTION
desc "Fetch property information and insert it into the database"

# RAKE TASK NAME    
task :print_properties => :environment do

    require 'rubygems'
    require 'nokogiri'

    module Xml
      class Parser
        def initialize(node, &block)
          @node = node
          @node.each do
            self.instance_eval &block
          end
        end

        def name
          @node.name
        end

        def inner_xml
          @node.inner_xml.strip
        end

        def is_start?
          @node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
        end

        def is_end?
          @node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT
        end

        def attribute(attribute)
          @node.attribute(attribute)
        end

        def for_element(name, &block)
          return unless self.name == name and is_start?
          self.instance_eval &block
        end

        def inside_element(name=nil, &block)
          return if @node.self_closing?
          return unless name.nil? or (self.name == name and is_start?)

          name = @node.name
          depth = @node.depth

          @node.each do
            return if self.name == name and is_end? and @node.depth == depth
            self.instance_eval &block
          end
        end
      end
    end


    Xml::Parser.new(Nokogiri::XML::Reader(open("app/assets/xml/mits.xml"))) do
        inside_element 'Property' do

            # OPEN AND PARSE THE <PropertyID> TAG
            inside_element 'PropertyID' do

                inside_element 'Identification' do
                    puts attribute_nodes()
                end

                # OPEN AND PARSE THE <Address> TAG
                inside_element 'Address' do
                    for_element 'AddressLine1' do puts "Street Address: #{inner_xml}" end
                    for_element 'City' do puts "City: #{inner_xml}" end
                    for_element 'PostalCode' do puts "Zipcode: #{inner_xml}" end
                end

            for_element 'MarketingName' do puts "Short Description: #{inner_xml}" end
            end

            # OPEN AND PARSE THE <Information> TAG
            inside_element 'Information' do
                for_element 'LongDescription' do puts "Long Description: #{inner_xml}" end
                inside_element 'Rents' do
                    for_element 'StandardRent' do puts "Rent: #{inner_xml}" end
                end
            end

            inside_element 'Fee' do
                for_element 'ApplicationFee' do puts "Application Fee: #{inner_xml}" end
            end

            inside_element 'ILS_Identification' do
                for_element 'Latitude' do puts "Latitude: #{inner_xml}" end
                for_element 'Longitude' do puts "Longitude: #{inner_xml}" end
            end

        end
    end

end #END INSERT_PROPERTIES TASK

end #END NAMESPACE

以及XML的示例 -

<Property IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
<PropertyID>
  <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/>
  <Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/>
  <MarketingName>Spacious House Central Campus OSU, available fall</MarketingName>
  <WebSite>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</WebSite>
  <Address AddressType="property">
    <Description>Address of Available Listing</Description>
    <AddressLine1>1689 N 4th St </AddressLine1>
    <City>Columbus</City>
    <State>OH</State>
    <PostalCode>43201</PostalCode>
    <Country>US</Country>
  </Address>
  <Phone PhoneType="office">
    <PhoneNumber>(614) 299-4110</PhoneNumber>
  </Phone>
  <Email>northsteppe.nsr@gmail.com</Email>
</PropertyID>
<ILS_Identification ILS_IdentificationType="Apartment" RentalType="Market Rate">
  <Latitude>39.997694</Latitude>
  <Longitude>-82.99903</Longitude>
  <LastUpdate Month="11" Day="11" Year="2013"/>
</ILS_Identification>
<Information>
  <StructureType>Standard</StructureType>
  <UnitCount>1</UnitCount>
  <ShortDescription>Spacious House Central Campus OSU, available fall</ShortDescription>
  <LongDescription>One of our favorites! This great house is perfect for students or a single family. With huge living and sleeping rooms, there is plenty of space. The kitchen is totally modernized with new appliances, and the bathroom has been updated. Natural woodwork and brick accents are seen within the house, and the decorative mantles. Ceiling fans and mini-blinds are included, as well as a FREE stack washer and dryer. The front and side deck. On site parking available.</LongDescription>
  <Rents>
    <StandardRent>2000.00</StandardRent>
  </Rents>
  <PropertyAvailabilityURL>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</PropertyAvailabilityURL>
</Information>
<Fee>
  <ProrateType>Standard</ProrateType>
  <LateType>Standard</LateType>
  <LatePercent>0</LatePercent>
  <LateMinFee>0</LateMinFee>
  <LateFeePerDay>0</LateFeePerDay>
  <NonRefundableHoldFee>0</NonRefundableHoldFee>
  <AdminFee>0</AdminFee>
  <ApplicationFee>30.00</ApplicationFee>
  <BrokerFee>0</BrokerFee>
</Fee>
<Deposit DepositType="Security Deposit">
  <Amount AmountType="Actual">
    <ValueRange Exact="2000.00" Currency="USD"/>
  </Amount>
</Deposit>
<Policy>
  <Pet Allowed="false"/>
</Policy>
<Phase IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <Name/>
  <Description/>
  <UnitCount>1</UnitCount>
  <RentableUnits>1</RentableUnits>
  <TotalSquareFeet>0</TotalSquareFeet>
  <RentableSquareFeet>0</RentableSquareFeet>
</Phase>
<Building IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <Name/>
  <Description/>
  <UnitCount>1</UnitCount>
  <SquareFeet>0</SquareFeet>
</Building>
<Floorplan IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <Name/>
  <UnitCount>1</UnitCount>
  <Room RoomType="Bedroom">
    <Count>4</Count>
    <Comment/>
  </Room>
  <Room RoomType="Bathroom">
    <Count>1</Count>
    <Comment/>
  </Room>
  <SquareFeet Min="0" Max="0"/>
  <MarketRent Min="2000" Max="2000"/>
  <EffectiveRent Min="2000" Max="2000"/>
</Floorplan>
<ILS_Unit IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <Units>
    <Unit>
      <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="UL Portfolio"/>
      <MarketingName>Spacious House Central Campus OSU, available fall</MarketingName>
      <UnitBedrooms>4</UnitBedrooms>
      <UnitBathrooms>1.0</UnitBathrooms>
      <MinSquareFeet>0</MinSquareFeet>
      <MaxSquareFeet>0</MaxSquareFeet>
      <SquareFootType>internal</SquareFootType>
      <UnitRent>2000.00</UnitRent>
      <MarketRent>2000.00</MarketRent>
      <Address AddressType="property">
        <AddressLine1>1689 N 4th St </AddressLine1>
        <City>Columbus</City>
        <PostalCode>43201</PostalCode>
        <Country>US</Country>
      </Address>
    </Unit>
  </Units>
  <Availability>
    <VacateDate Month="7" Day="23" Year="2014"/>
    <VacancyClass>Unoccupied</VacancyClass>
    <MadeReadyDate Month="7" Day="23" Year="2014"/>
  </Availability>
  <Amenity AmenityType="Other">
    <Description>All new stainless steel appliances!  Refinished hardwood floors</Description>
  </Amenity>
  <Amenity AmenityType="Other">
    <Description>Ceramic tile</Description>
  </Amenity>
  <Amenity AmenityType="Other">
    <Description>Ceiling fans</Description>
  </Amenity>
  <Amenity AmenityType="Other">
    <Description>Wrap-around porch</Description>
  </Amenity>
  <Amenity AmenityType="Dryer">
    <Description>Free Washer and Dryer</Description>
  </Amenity>
  <Amenity AmenityType="Washer">
    <Description>Free Washer and Dryer</Description>
  </Amenity>
  <Amenity AmenityType="Other">
    <Description>off-street parking available</Description>
  </Amenity>
</ILS_Unit>
<File Active="true" FileID="820982141">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/31077069-6e81-4373-8a89-508c57585543/medium.jpg</Src>
  <Width>360</Width>
  <Height>300</Height>
  <Rank>1</Rank>
</File>
<File Active="true" FileID="820982145">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/84e1be40-96fd-4717-b75d-09b39231a762/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>2</Rank>
</File>
<File Active="true" FileID="820982149">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/cd419635-c37f-4676-a43e-c72671a2a748/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>3</Rank>
</File>
<File Active="true" FileID="820982152">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/6b68dbd5-2cde-477c-99d7-3ca33f03cce8/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>4</Rank>
</File>
<File Active="true" FileID="820982155">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/17b6c7c0-686c-4e46-865b-11d80744354a/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>5</Rank>
</File>
<File Active="true" FileID="820982157">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/3545ac8b-471f-404a-94b2-fcd00dd16e25/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>6</Rank>
</File>
<File Active="true" FileID="820982160">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/02471172-2183-4bf1-a3d7-33415f902c1c/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>7</Rank>
</File>
  </Property>

2 个答案:

答案 0 :(得分:1)

首先尝试这个:

require 'nokogiri'

doc = Nokogiri::XML(File.read('test.xml'))
doc.search('*[OrganizationName="northsteppe"]') 
# => [#<Nokogiri::XML::Element:0x3fd82499131c name="Identification" attributes=[#<Nokogiri::XML::Attr:0x3fd8249912b8 name="IDValue" value="642da00e-9be3-4a7c-bd50-66a4f0d70af8">, #<Nokogiri::XML::Attr:0x3fd8249912a4 name="OrganizationName" value="northsteppe">, #<Nokogiri::XML::Attr:0x3fd824991290 name="IDType" value="property">]>, #<Nokogiri::XML::Element:0x3fd824990a70 name="Identification" attributes=[#<Nokogiri::XML::Attr:0x3fd824990a0c name="IDValue" value="6e1e61523972d5f0e260e3d38eb488337424f21e">, #<Nokogiri::XML::Attr:0x3fd8249909f8 name="OrganizationName" value="northsteppe">, #<Nokogiri::XML::Attr:0x3fd8249909e4 name="IDType" value="Company">]>]

使Nokogiri发现更具可读性:

puts doc.search('*[OrganizationName="northsteppe"]').map{ |n| n.to_xml }
# >> <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/>
# >> <Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/>

我发现使用CSS通常比XPath更具可读性。在这种情况下,这是一个折腾。


  

...实际文件为300MB,在DOM中加载会导致服务器崩溃。

如果您的服务器无法处理文件大小,那么您最好的选择是SAX解析器,它可以获得内存效率。以下是使用示例XML的简单示例:

require 'nokogiri'

class MyDocument < Nokogiri::XML::SAX::Document
  @@tags = []

  def start_element name, attributes = []

    attribute_hash = Hash[attributes]
    if (name == 'Identification') && (attribute_hash['OrganizationName'] == 'northsteppe')
      @@tags << {
        name: name,
        attributes: attribute_hash
      }
    end
  end

  def tags
    @@tags
  end
end

doc = MyDocument.new

# Create a new parser
parser = Nokogiri::XML::SAX::Parser.new(doc)

# Feed the parser some XML
parser.parse(File.open('test.xml'))

doc.tags 
# => [{:name=>"Identification",
#      :attributes=>
#       {"IDValue"=>"642da00e-9be3-4a7c-bd50-66a4f0d70af8",
#        "OrganizationName"=>"northsteppe",
#        "IDType"=>"property"}},
#     {:name=>"Identification",
#      :attributes=>
#       {"IDValue"=>"6e1e61523972d5f0e260e3d38eb488337424f21e",
#        "OrganizationName"=>"northsteppe",
#        "IDType"=>"Company"}}]

答案 1 :(得分:0)

所以我发现的解决方案是在一个名为Saxerator(https://github.com/soulcutter/saxerator)的小宝石中。 SAX Parsing,没有Nokogiri(谢谢),拥有出色的文档并且运行速度超快。我会鼓励任何需要在未来使用SAX Parser来调查这个小宝石(双关语),并减轻必须处理所有可怕的Nokogiri文档的负担。我的问题的解决方案如下,位于我的seeds.rb文件中。

    require 'saxerator'

parser = Saxerator.parser(File.new("app/assets/xml/mits_snip.xml")) do |config|
  config.put_attributes_in_hash!
  config.symbolize_keys!
end


parser.for_tag(:Property).each do |property|
    if property[:PropertyID][:Identification][1][:OrganizationName] == 'northsteppe'
        property_attributes = {
            street_address:     property[:PropertyID][:Address][:AddressLine1],
            city:               property[:PropertyID][:Address][:City],
            zipcode:            property[:PropertyID][:Address][:PostalCode],
            short_description:  property[:PropertyID][:MarkertName],
            long_description:   property[:Information][:LongDescription],
            rent:               property[:Information][:Rents][:StandardRent],
            application_fee:    property[:Fee][:ApplicationFee],
            vacancy_status:     property[:ILS_Unit][:Availability][:VacancyClass],
            month_available:    property[:ILS_Unit][:Availability][:MadeReadyDate][:Month],
            latitude:           property[:ILS_Identification][:Latitude],
            longitude:          property[:ILS_Identification][:Longitude]

        }

        if Property.create! property_attributes
            puts "wahoo"
        else
            puts "nope"
        end
    end
end

==============更新=================

所以我实际上重写了这个任务做得更好,只是想在这里分享它,任何人都会遇到这个问题 - 这是我的seeds.rb文件

require 'saxerator'
require 'open-uri'
@company_name = 'northsteppe'
parser = Saxerator.parser(File.new("../../shared/assets/xml/mits.xml")) do |config|
  config.put_attributes_in_hash!
  config.symbolize_keys!
end
puts "DELETED ALL EXISITNG PROPERTIES" if Property.delete_all
puts "PULLING RELEVENT XML ENTERIES"
@@count = 0
file = File.new("../../shared/assets/xml/nsr_properties.xml",'w')
properties = []
parser.for_tag(:Property).each do |property|
    print '*'
    if property[:PropertyID][:Identification][1][:OrganizationName] == @company_name
        properties << property
        @@count = @@count +1
    end
    # break if @@count == 417 
end
file.write(properties.to_xml)
file.close
puts "ADDING PROPERTIES TO THE DATABASE"
nsr_properties = File.open("../../shared/assets/xml/nsr_properties.xml")
doc = Nokogiri::XML(nsr_properties)
doc.xpath("//saxerator-builder-hash-elements/saxerator-builder-hash-element").each do |property|
    print '.'
    @images =[]
    property.xpath("File/File").each do |image|
        @images << image.at_xpath("Src/text()").to_s
    end
    @amenities = []
    property.xpath("ILS-Unit/Amenity/Amenity").each do |amenity|
        @amenities << amenity.at_xpath("Description/text()").to_s
    end
    information = {
        "street_address" => property.at_xpath("PropertyID/Address/AddressLine1/text()").to_s,
        "city" => property.at_xpath("PropertyID/Address/City/text()").to_s,
        "zipcode" => property.at_xpath("PropertyID/Address/PostalCode/text()").to_s,
        "short_description" => property.at_xpath("PropertyID/MarketingName/text()").to_s,
        "long_description" => property.at_xpath("Information/LongDescription/text()").to_s,
        "rent" => property.at_xpath("Information/Rents/StandardRent/text()").to_s,
        "application_fee" => property.at_xpath("Fee/ApplicationFee/text()").to_s,
        "bedrooms" => property.at_xpath("ILS-Unit/Units/Unit/UnitBedrooms/text()").to_s,
        "bathrooms" => property.at_xpath("ILS-Unit/Units/Unit/UnitBathrooms/text()").to_s,
        "vacancy_status" => property.at_xpath("ILS-Unit/Availability/VacancyClass/text()").to_s,
        "month_available" => property.at_xpath("ILS-Unit/Availability/MadeReadyDate/@Month").to_s,
        "latitude" => property.at_xpath("ILS-Identification/Latitude/text()").to_s,
        "longitude" => property.at_xpath("ILS-Identification/Longitude/text()").to_s,
        "images" => @images,
        "amenities" => @amenities
    }
    Property.create!(information)
end
puts "DONE, WAHOO"