基于给定元素的N次出现拆分大型XML文件

时间:2018-01-19 10:40:39

标签: xml perl

我希望根据给定元素的N次出现将~300MB的XML文件拆分为单独的文件。

我的源XML是:

<?xml version="1.0" encoding="UTF-8"?>
<pmlcore:Sensor
        xsi:schemaLocation="urn:autoid:specification:interchange:PMLCore:xml:schema:1 ./PML/SchemaFiles/Interchange/PMLCore.xsd"
        xmlns:pmlcore="urn:autoid:specification:interchange:PMLCore:xml:schema:1"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns:pmluid="urn:autoid:specification:universal:Identifier:xml:schema:1">
  <paraid:ID>1234</paraid:ID>      
  <pmlcore:Observation>
    <childtag>Name1</childtag>
    <childtag2>Number1</childtag2>
    <childtag3>
      <childtag4></childtag4>
    </childtag3>
  </pmlcore:Observation>
  <pmlcore:Observation>
    <childtag>Name2</childtag>
    <childtag2>Number2</childtag2>
    <childtag3>
      <childtag4></childtag4>
    </childtag3>
  </pmlcore:Observation>
  <pmlcore:Observation>
    <childtag>Name3</childtag>
    <childtag2>Number3</childtag2>
    <childtag3>
      <childtag4></childtag4>
    </childtag3>
  </pmlcore:Observation>
  <pmlcore:Observation>
    <childtag>Name4</childtag>
    <childtag2>Number4</childtag2>
    <childtag3>
      <childtag4></childtag4>
    </childtag3>
  </pmlcore:Observation>
</pmlcore:Sensor>

如果我的输入文件如上所述,那么我想根据pmlcore:Observation元素的每10次出现将其拆分为单独的文件。

出于测试目的,比如我上面的输入XML,我希望看到文件在pmlcore:Observation元素的每两次出现时被拆分(输入文件的前两行 - XML prolog和{{ 1}} - 插入每个拆分文件中)。

然后我的XML将分成两个文件:

Name1.txt

paraID:ID

Name2.txt

<?xml version="1.0" encoding="UTF-8"?>
<pmlcore:Sensor
        xsi:schemaLocation="urn:autoid:specification:interchange:PMLCore:xml:schema:1 ./PML/SchemaFiles/Interchange/PMLCore.xsd"
        xmlns:pmlcore="urn:autoid:specification:interchange:PMLCore:xml:schema:1"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns:pmluid="urn:autoid:specification:universal:Identifier:xml:schema:1">
  <paraid:ID>1234</paraid:ID>      
  <pmlcore:Observation>
    <childtag>Name1</childtag>
    <childtag2>Number1</childtag2>
    <childtag3>
      <childtag4></childtag4>
    </childtag3>
  </pmlcore:Observation>
  <pmlcore:Observation>
    <childtag>Name2</childtag>
    <childtag2>Number2</childtag2>
    <childtag3>
      <childtag4></childtag4>
    </childtag3>
  </pmlcore:Observation>
</pmlcore:Sensor>

我已经能够用awk做到这一点,但这个过程非常缓慢。我想知道是否有一种简单但有效的方法可以通过Perl脚本完成此操作(也许使用 XML::Twig )。

3 个答案:

答案 0 :(得分:3)

您可以使用XML::LibXML::Reader,来自libxml2的拉解析器:

#!/usr/bin/perl
use strict;
use warnings;

use XML::LibXML::Reader;
use constant {
    SIZE    => 2, # 10
    PMLUID  => 'urn:autoid:specification:universal:Identifier:xml:schema:1',
    PMLCORE => 'urn:autoid:specification:interchange:PMLCore:xml:schema:1',
};

my $name_tally = 0;
sub output {
    my ($orig_root, $id, @observations) = @_;

    my $root = $orig_root->cloneNode;
    $root->addChild($id);
    $root->addChild($_) for @observations;
    ++$name_tally;
    open my $OUT, '>:encoding(UTF-8)', "Name$name_tally.txt" or die $!;
    print {$OUT} $root;
    print STDERR "$name_tally\n";
}

my $reader = 'XML::LibXML::Reader'->new(location => shift)
    or die;

$reader->read;
my $root = $reader->copyCurrentNode;

$reader->nextElement('ID', PMLUID) or die "No ID\n";
my $id = $reader->copyCurrentNode(1);

my @observations;
while ($reader->nextElement('Observation', PMLCORE)) {
    push @observations, $reader->copyCurrentNode(1);
    if (@observations == SIZE) {
        output($root, $id, @observations);
        @observations = ();
    }
}
# Output the reminder if the size and total are coprime.
output($root, $id, @observations) if @observations;

答案 1 :(得分:1)

您绝对可以使用XML::Twig使用purge来降低内存占用率。

#!/usr/bin/env perl

use strict;
use warnings;
use XML::Twig;

#the number of elements per file
my $subelt_count = 2;
#a running tally for numbering output files
my $count        = 1;
#a holding space for the current 'batch' that gets emptied each time
my @processed_elements;

#called when the parser hits a matching element, as it goes. 
sub process_xml {
   my ( $twig, $elt ) = @_;

   push( @processed_elements, $elt );

   if ( @processed_elements >= $subelt_count ) {

      #2 processed so far, start a new file
      open( my $output, ">", "file_" . $count++ . ".xml" ) or die $!;
      print {$output} $twig->sprint;
      close($output);

      #delete the elements we've already printed
      $_->delete for @processed_elements;
      @processed_elements = ();
      #Dump processed stuff from memory
      $twig -> purge;
   }
}

my $parser =
  XML::Twig->new(
   twig_handlers => { 'pmlcore:Observation' => \&process_xml } );
$parser->set_pretty_print('indented_a');
$parser->parsefile ( 'input_file_name.xml' );

#in case there's any trailing elements (e.g. there's not exactly a multiple of $subelt_count in the file), otherwise they'll be discarded
if ( $parser->get_xpath('//pmlcore:Observation') ) {
   $parser->print;
}

注意 - 这将在<pmlcore:Observation>标记级别上运行 - 因此您的<paraid:ID>1234</paraid:ID>只会打印在一个文档中。我无法判断这是否是应该明确处理的特殊情况,但您也可以采用类似的方法来保留此标记。否则,第一个$twig -> purge将清除所有已保存内存的已关闭标记,其中包含此标记。

在使用purge来节省内存的同时,没有办法解决这个问题。如果您不是purge可能仍然可以,因为我们正在删除元素。

所以你可以:

  • 注释掉$twig -> purge行。 (并接受可能会增加内存开销,但可能不会)
  • &#39;保存&#39; &#39; paraid&#39;

这样的事情:

#!/usr/bin/env perl

use strict;
use warnings;
use XML::Twig;

my $subelt_count = 2;
my $count        = 1;
my @processed_elements;

my $paraid; 

sub process_xml {
   my ( $twig, $elt ) = @_;

   push( @processed_elements, $elt );
   if ( my $new_paraid = $elt -> parent -> first_child('paraid:ID') ) {
      $paraid = $new_paraid;
      $paraid -> cut; 
   }
   if ( @processed_elements >= $subelt_count ) {

      #2 processed so far, start a new file
      open( my $output, ">", "file_" . $count++ . ".xml" ) or die $!;
      $paraid -> paste ( $twig -> root );
      print {$output} $twig->sprint;
      close($output);
      $_->delete for @processed_elements;
      @processed_elements = ();
      $twig -> purge;
   }
}

my $parser =
  XML::Twig->new(
   twig_handlers => { 'pmlcore:Observation' => \&process_xml } );
$parser->set_pretty_print('indented_a');
$parser->parsefile ( 'your_input_file.xml' );

#in case there's any trailing elements;
if ( $parser->get_xpath('//pmlcore:Observation') ) {
   $parser->print;
}

答案 2 :(得分:1)

这会按照你的要求行事。它使用了twig_roots设施 XML::Twig模块 这样整个XML数据就不会在内存中累积

pmlcore:Observation元素有一个回调。当找到其中一个时。如果这导致CHUNK_SIZE这样的元素,则调用print_chunk来编写新的块文件,并删除准备添加新子元素的所有pmlcore:Observation元素

解析文件后,将检查副本是否从未写入磁盘的pmlcore:Observation个元素。如果找到任何内容,则再次调用print_chunk

请注意,我已在测试数据中添加了第五个pmlcore:Observation,以测试块中没有确切数量的块的情况。这导致Name3.txt只用一个观察点编写

我还使用了autodie pragma来避免必须显式测试每个IO操作的状态,例如openclose

此程序需要输入XML文件的路径作为命令行上的参数

use strict;
use warnings 'all';
use feature 'say';
use autodie;

use XML::Twig;

use constant CHUNK_SIZE => 2;

my ( $xml_file ) = @ARGV or die "No input XML file specified";

my $twig = XML::Twig->new(
    twig_roots               => { 'pmlcore:Observation' => \&handle_obs },
    twig_print_outside_roots => 0,
    pretty_print             => 'indented',
);

$twig->parsefile( $xml_file );

# Print any remaining chunks
print_chunk() if $twig->root->has_child( 'pmlcore:Observation' );

sub handle_obs {
    my ( $twig, $elem ) = @_;

    my $n = $twig->root->children_count( 'pmlcore:Observation' );
    print_chunk() if $n >= CHUNK_SIZE;
}

my $n;

sub print_chunk {

    my $filename = sprintf 'Name%d.txt', ++$n;

    open my $fh, '>', $filename;
    $twig->print( $fh );
    close $fh;

    say qq{"$filename" written};

    $_->delete for $twig->root->children( 'pmlcore:Observation' );
}

输出

Name1.txt

<pmlcore:Sensor xmlns:pmlcore="urn:autoid:specification:interchange:PMLCore:xml:schema:1" xmlns:pmluid="urn:autoid:specification:universal:Identifier:xml:schema:1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:autoid:specification:interchange:PMLCore:xml:schema:1 ./PML/SchemaFiles/Interchange/PMLCore.xsd">
  <paraid:ID>1234</paraid:ID>
  <pmlcore:Observation>
    <childtag>Name1</childtag>
    <childtag2>Number1</childtag2>
    <childtag3>
      <childtag4></childtag4>
    </childtag3>
  </pmlcore:Observation>
  <pmlcore:Observation>
    <childtag>Name2</childtag>
    <childtag2>Number2</childtag2>
    <childtag3>
      <childtag4></childtag4>
    </childtag3>
  </pmlcore:Observation>
</pmlcore:Sensor>

Name2.txt

<pmlcore:Sensor xmlns:pmlcore="urn:autoid:specification:interchange:PMLCore:xml:schema:1" xmlns:pmluid="urn:autoid:specification:universal:Identifier:xml:schema:1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:autoid:specification:interchange:PMLCore:xml:schema:1 ./PML/SchemaFiles/Interchange/PMLCore.xsd">
  <paraid:ID>1234</paraid:ID>
  <pmlcore:Observation>
    <childtag>Name3</childtag>
    <childtag2>Number3</childtag2>
    <childtag3>
      <childtag4></childtag4>
    </childtag3>
  </pmlcore:Observation>
  <pmlcore:Observation>
    <childtag>Name4</childtag>
    <childtag2>Number4</childtag2>
    <childtag3>
      <childtag4></childtag4>
    </childtag3>
  </pmlcore:Observation>
</pmlcore:Sensor>

Name3.txt

<pmlcore:Sensor xmlns:pmlcore="urn:autoid:specification:interchange:PMLCore:xml:schema:1" xmlns:pmluid="urn:autoid:specification:universal:Identifier:xml:schema:1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:autoid:specification:interchange:PMLCore:xml:schema:1 ./PML/SchemaFiles/Interchange/PMLCore.xsd">
  <paraid:ID>1234</paraid:ID>
  <pmlcore:Observation>
    <childtag>Name5</childtag>
    <childtag2>Number5</childtag2>
    <childtag3>
      <childtag4></childtag4>
    </childtag3>
  </pmlcore:Observation>
</pmlcore:Sensor>
相关问题