使用XML :: TWIG从XML文件中获取特定的原始元素及其子元素?

时间:2018-06-21 01:20:47

标签: perl xml-parsing

我有以下大型xml文件(5-10gb):

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>          
   </book>
   <car id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
   </car>
   <book id="bk101">
      <author>Joseph</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
   </book>
   <magazine id="bk103">
      <author>Gambardella, Matthew</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
   </magazine>
   .....
</catalog>

我如何使用XML TWIG或PERL中的任何其他方法从书籍和杂志元素(忽略汽车)中读取内容,而仅将包含作者姓名Gambardella,Matthew的元素(整个块)提取到新文件中?

   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>          
   </book>      
   <magazine id="bk103">
      <author>Gambardella, Matthew</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
   </magazine>
   .....
</catalog>

1 个答案:

答案 0 :(得分:0)

该脚本期望XML文件作为命令行参数,并将删除所有与$criteria不匹配的元素。您还应该考虑将输入文件分成较小的块,以避免出现out of memory问题。

#!/usr/bin/env perl

use warnings FATAL => 'all';
use strict;
use XML::Twig;

my $criteria = 'Gambardella, Matthew';
my $xml  = XML::Twig->new(
  twig_handlers => {
    'catalog/*' => \&catalog,
  },
  pretty_print => 'indented',
)->parsefile($ARGV[0]);

print $xml->toString();

sub catalog {
  my ($t, $catalog) = @_;

  $catalog->cut() unless $catalog->findvalue('author') eq $criteria;

  return;
}