我希望根据给定元素的N次出现将~300MB的XML文件拆分为单独的文件。
我的源XML是:
<?xml version="1.0" encoding="UTF-8"?>
<pmlcore:Sensor
xsi:schemaLocation="urn:autoid:specification:interchange:PMLCore:xml:schema:1 ./PML/SchemaFiles/Interchange/PMLCore.xsd"
xmlns:pmlcore="urn:autoid:specification:interchange:PMLCore:xml:schema:1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:pmluid="urn:autoid:specification:universal:Identifier:xml:schema:1">
<paraid:ID>1234</paraid:ID>
<pmlcore:Observation>
<childtag>Name1</childtag>
<childtag2>Number1</childtag2>
<childtag3>
<childtag4></childtag4>
</childtag3>
</pmlcore:Observation>
<pmlcore:Observation>
<childtag>Name2</childtag>
<childtag2>Number2</childtag2>
<childtag3>
<childtag4></childtag4>
</childtag3>
</pmlcore:Observation>
<pmlcore:Observation>
<childtag>Name3</childtag>
<childtag2>Number3</childtag2>
<childtag3>
<childtag4></childtag4>
</childtag3>
</pmlcore:Observation>
<pmlcore:Observation>
<childtag>Name4</childtag>
<childtag2>Number4</childtag2>
<childtag3>
<childtag4></childtag4>
</childtag3>
</pmlcore:Observation>
</pmlcore:Sensor>
如果我的输入文件如上所述,那么我想根据pmlcore:Observation
元素的每10次出现将其拆分为单独的文件。
出于测试目的,比如我上面的输入XML,我希望看到文件在pmlcore:Observation
元素的每两次出现时被拆分(输入文件的前两行 - XML prolog和{{ 1}} - 插入每个拆分文件中)。
然后我的XML将分成两个文件:
paraID:ID
<?xml version="1.0" encoding="UTF-8"?>
<pmlcore:Sensor
xsi:schemaLocation="urn:autoid:specification:interchange:PMLCore:xml:schema:1 ./PML/SchemaFiles/Interchange/PMLCore.xsd"
xmlns:pmlcore="urn:autoid:specification:interchange:PMLCore:xml:schema:1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:pmluid="urn:autoid:specification:universal:Identifier:xml:schema:1">
<paraid:ID>1234</paraid:ID>
<pmlcore:Observation>
<childtag>Name1</childtag>
<childtag2>Number1</childtag2>
<childtag3>
<childtag4></childtag4>
</childtag3>
</pmlcore:Observation>
<pmlcore:Observation>
<childtag>Name2</childtag>
<childtag2>Number2</childtag2>
<childtag3>
<childtag4></childtag4>
</childtag3>
</pmlcore:Observation>
</pmlcore:Sensor>
我已经能够用awk做到这一点,但这个过程非常缓慢。我想知道是否有一种简单但有效的方法可以通过Perl脚本完成此操作(也许使用
XML::Twig
)。
答案 0 :(得分:3)
您可以使用XML::LibXML::Reader,来自libxml2的拉解析器:
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML::Reader;
use constant {
SIZE => 2, # 10
PMLUID => 'urn:autoid:specification:universal:Identifier:xml:schema:1',
PMLCORE => 'urn:autoid:specification:interchange:PMLCore:xml:schema:1',
};
my $name_tally = 0;
sub output {
my ($orig_root, $id, @observations) = @_;
my $root = $orig_root->cloneNode;
$root->addChild($id);
$root->addChild($_) for @observations;
++$name_tally;
open my $OUT, '>:encoding(UTF-8)', "Name$name_tally.txt" or die $!;
print {$OUT} $root;
print STDERR "$name_tally\n";
}
my $reader = 'XML::LibXML::Reader'->new(location => shift)
or die;
$reader->read;
my $root = $reader->copyCurrentNode;
$reader->nextElement('ID', PMLUID) or die "No ID\n";
my $id = $reader->copyCurrentNode(1);
my @observations;
while ($reader->nextElement('Observation', PMLCORE)) {
push @observations, $reader->copyCurrentNode(1);
if (@observations == SIZE) {
output($root, $id, @observations);
@observations = ();
}
}
# Output the reminder if the size and total are coprime.
output($root, $id, @observations) if @observations;
答案 1 :(得分:1)
您绝对可以使用XML::Twig
使用purge
来降低内存占用率。
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
#the number of elements per file
my $subelt_count = 2;
#a running tally for numbering output files
my $count = 1;
#a holding space for the current 'batch' that gets emptied each time
my @processed_elements;
#called when the parser hits a matching element, as it goes.
sub process_xml {
my ( $twig, $elt ) = @_;
push( @processed_elements, $elt );
if ( @processed_elements >= $subelt_count ) {
#2 processed so far, start a new file
open( my $output, ">", "file_" . $count++ . ".xml" ) or die $!;
print {$output} $twig->sprint;
close($output);
#delete the elements we've already printed
$_->delete for @processed_elements;
@processed_elements = ();
#Dump processed stuff from memory
$twig -> purge;
}
}
my $parser =
XML::Twig->new(
twig_handlers => { 'pmlcore:Observation' => \&process_xml } );
$parser->set_pretty_print('indented_a');
$parser->parsefile ( 'input_file_name.xml' );
#in case there's any trailing elements (e.g. there's not exactly a multiple of $subelt_count in the file), otherwise they'll be discarded
if ( $parser->get_xpath('//pmlcore:Observation') ) {
$parser->print;
}
注意 - 这将在<pmlcore:Observation>
标记级别上运行 - 因此您的<paraid:ID>1234</paraid:ID>
只会打印在一个文档中。我无法判断这是否是应该明确处理的特殊情况,但您也可以采用类似的方法来保留此标记。否则,第一个$twig -> purge
将清除所有已保存内存的已关闭标记,其中包含此标记。
在使用purge
来节省内存的同时,没有办法解决这个问题。如果您不是purge
,可能仍然可以,因为我们正在删除元素。
所以你可以:
$twig -> purge
行。 (并接受可能会增加内存开销,但可能不会)这样的事情:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $subelt_count = 2;
my $count = 1;
my @processed_elements;
my $paraid;
sub process_xml {
my ( $twig, $elt ) = @_;
push( @processed_elements, $elt );
if ( my $new_paraid = $elt -> parent -> first_child('paraid:ID') ) {
$paraid = $new_paraid;
$paraid -> cut;
}
if ( @processed_elements >= $subelt_count ) {
#2 processed so far, start a new file
open( my $output, ">", "file_" . $count++ . ".xml" ) or die $!;
$paraid -> paste ( $twig -> root );
print {$output} $twig->sprint;
close($output);
$_->delete for @processed_elements;
@processed_elements = ();
$twig -> purge;
}
}
my $parser =
XML::Twig->new(
twig_handlers => { 'pmlcore:Observation' => \&process_xml } );
$parser->set_pretty_print('indented_a');
$parser->parsefile ( 'your_input_file.xml' );
#in case there's any trailing elements;
if ( $parser->get_xpath('//pmlcore:Observation') ) {
$parser->print;
}
答案 2 :(得分:1)
这会按照你的要求行事。它使用了twig_roots
设施
XML::Twig
模块
这样整个XML数据就不会在内存中累积
pmlcore:Observation
元素有一个回调。当找到其中一个时。如果这导致CHUNK_SIZE
这样的元素,则调用print_chunk
来编写新的块文件,并删除准备添加新子元素的所有pmlcore:Observation
元素
解析文件后,将检查副本是否从未写入磁盘的pmlcore:Observation
个元素。如果找到任何内容,则再次调用print_chunk
请注意,我已在测试数据中添加了第五个pmlcore:Observation
,以测试块中没有确切数量的块的情况。这导致Name3.txt
只用一个观察点编写
我还使用了autodie
pragma来避免必须显式测试每个IO操作的状态,例如open
和close
此程序需要输入XML文件的路径作为命令行上的参数
use strict;
use warnings 'all';
use feature 'say';
use autodie;
use XML::Twig;
use constant CHUNK_SIZE => 2;
my ( $xml_file ) = @ARGV or die "No input XML file specified";
my $twig = XML::Twig->new(
twig_roots => { 'pmlcore:Observation' => \&handle_obs },
twig_print_outside_roots => 0,
pretty_print => 'indented',
);
$twig->parsefile( $xml_file );
# Print any remaining chunks
print_chunk() if $twig->root->has_child( 'pmlcore:Observation' );
sub handle_obs {
my ( $twig, $elem ) = @_;
my $n = $twig->root->children_count( 'pmlcore:Observation' );
print_chunk() if $n >= CHUNK_SIZE;
}
my $n;
sub print_chunk {
my $filename = sprintf 'Name%d.txt', ++$n;
open my $fh, '>', $filename;
$twig->print( $fh );
close $fh;
say qq{"$filename" written};
$_->delete for $twig->root->children( 'pmlcore:Observation' );
}
<pmlcore:Sensor xmlns:pmlcore="urn:autoid:specification:interchange:PMLCore:xml:schema:1" xmlns:pmluid="urn:autoid:specification:universal:Identifier:xml:schema:1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:autoid:specification:interchange:PMLCore:xml:schema:1 ./PML/SchemaFiles/Interchange/PMLCore.xsd">
<paraid:ID>1234</paraid:ID>
<pmlcore:Observation>
<childtag>Name1</childtag>
<childtag2>Number1</childtag2>
<childtag3>
<childtag4></childtag4>
</childtag3>
</pmlcore:Observation>
<pmlcore:Observation>
<childtag>Name2</childtag>
<childtag2>Number2</childtag2>
<childtag3>
<childtag4></childtag4>
</childtag3>
</pmlcore:Observation>
</pmlcore:Sensor>
<pmlcore:Sensor xmlns:pmlcore="urn:autoid:specification:interchange:PMLCore:xml:schema:1" xmlns:pmluid="urn:autoid:specification:universal:Identifier:xml:schema:1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:autoid:specification:interchange:PMLCore:xml:schema:1 ./PML/SchemaFiles/Interchange/PMLCore.xsd">
<paraid:ID>1234</paraid:ID>
<pmlcore:Observation>
<childtag>Name3</childtag>
<childtag2>Number3</childtag2>
<childtag3>
<childtag4></childtag4>
</childtag3>
</pmlcore:Observation>
<pmlcore:Observation>
<childtag>Name4</childtag>
<childtag2>Number4</childtag2>
<childtag3>
<childtag4></childtag4>
</childtag3>
</pmlcore:Observation>
</pmlcore:Sensor>
<pmlcore:Sensor xmlns:pmlcore="urn:autoid:specification:interchange:PMLCore:xml:schema:1" xmlns:pmluid="urn:autoid:specification:universal:Identifier:xml:schema:1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:autoid:specification:interchange:PMLCore:xml:schema:1 ./PML/SchemaFiles/Interchange/PMLCore.xsd">
<paraid:ID>1234</paraid:ID>
<pmlcore:Observation>
<childtag>Name5</childtag>
<childtag2>Number5</childtag2>
<childtag3>
<childtag4></childtag4>
</childtag3>
</pmlcore:Observation>
</pmlcore:Sensor>