Question

我有近500个xhtml文件，我想在所有文件中找到重复的ID。目标是获取一个文件读取id =＆＃34; xxx＆＃34;其余文件中不应出现相同的ID。如果通过错误消息找到，则chapter1 id出现在其他一些章节文件中。

我试过这个结果也来了，但需要运行程序将近15分钟。我想要有效的编码请帮助。

foreach my $xhtml(@xhtml_files){
        my $htmlcnt = _open_file("$dirname\\$xhtml");
        my @Duplicate_xhtml_files = ();

        #-------------The external (ID) matched with (filename)-------------------
        @Duplicate_xhtml_files = _get_file_list($ARGV[0],1,0,'\.xhtml$',$xhtml); 
        my @array_ids = $htmlcnt =~ m{( id="[^>"]+")}isg;
        my $array_joinids = "##".join("##",@array_ids)."##";
        foreach my $file(@Duplicate_xhtml_files){
            my $duplicate_htmlcnt = _open_file("$dirname\\$file");
            while($duplicate_htmlcnt =~ m{( id="[^>"]+")}isg){
                my $pre = $`;   my $check_id = $1;
                if($array_joinids =~ m{\#\#$check_id\#\#}is){
                    ($ln, $cl) = LineCol($pre);
                    $Error .="\n\t[$ln:$cl]\: Error:[MV-1024]\: $file => The external ($check_id) matched with ($xhtml).\n";
                }
            }
        }
     }

提前感谢。

Answer 1

在循环外做尽可能多的工作 。

循环中的循环（隐藏循环中）。这可能很慢。

如果有500 @xhtml_files，那么假设每个文件中的@Duplicate_xhtml_files和100 id中有100个。我们不要忘记搜索$array_joinids，这是ID列表的隐藏循环！假设$array_joinids中有100个ID。

foreach my $xhtml (@xhtml_files) {
    ...500 times...
    foreach my $file(@Duplicate_xhtml_files) {
        ...50,000 times...
        while($duplicate_htmlcnt =~ m{( id="[^>"]+")}isg){
            ...5,000,000 times...

            # This is really looping over all the ids, so
            # you're looking at IDs 500,000,000 times.
            if($array_joinids =~ m{\#\#$check_id\#\#}is){
            }
        }
    }
}

这只是猜测，但你明白了：在内循环中工作会大大增加成本。 尽可能高地做你想做的一切。

例如，如果您需要在列表中查找特定元素，则表示您需要遍历整个列表。循环中有太多循环。而且你是通过将列表转换为##分隔的字符串（循环）然后使用正则表达式（循环）反复搜索该字符串而以一种非常模糊的方式进行的。

此...

my @array_ids = $htmlcnt =~ m{( id="[^>"]+")}isg;
my $array_joinids = "##".join("##",@array_ids)."##";

while($duplicate_htmlcnt =~ m{ id="([^>"]+)"}isg) {
    my $check_id = $1;
    if($array_joinids =~ m{\#\#$check_id\#\#}is) {
        ...
    }
}

这样做得好得多......

my @ids = $htmlcnt =~ m{( id="[^>"]+")}isg;
while($duplicate_htmlcnt =~ m{ id="([^>"]+)"}isg) {
    my $check_id = $1;
    if( grep { $_ eq $check_id } @ids ) {
        ...
    }
}

但是仍然在你的关键内循环中循环遍历所有ID（grep）。你可以使用List::Util::first让它更快一点，所以它会停在比赛上，但它只是重新安排泰坦尼克号的椅子。真正的表现胜利是摆脱最内层的循环。

相反，请使用哈希。然后，您不必遍历关键最内层循环中的所有ID，而是可以执行快速散列查找。无论有多少元素，散列查找都是相同的速度。

# Also only store the ID, not all the HTML around it.
my %ids = map { $_ => 1 } = $htmlcnt =~ m{ id="([^>"]+)"}isg;

while($duplicate_htmlcnt =~ m{ id="([^>"]+)"}isg) {
    my $check_id = $1;
    if( $ids{$check_id} ) {
        ...it's a duplicate!...
    }
}

另一个明显的目标是消除所有内部内部循环。扫描所有文件一次，存储所有ID，然后使用%all_ids检查重复项。这样可以避免多次解析相同的XHTML文件。

# This will hold what IDs are in what files.
my %all_ids;

# Record which IDs are in which files.
for my $xhtml (@all_xhtml_files) {
     ...
     while( $htmlcnt =~ m{ id="([^>"]+)"}isg ) {
         $all_ids{$1}{$xhtml} = 1;
     }
}

# Now go through the list of IDs and look for ones that are in
# more than one file.
for my $id (keys %all_ids) {
    my $in_files = $all_ids{$id};
    if( @$in_files > 1 ) {
        print "Duplicate ID $id seen in @$in_files";
    }
}

您必须修改它以获取重复检测的详细信息，但您明白了。

顺便说一句，除非您在修复此问题时使用Perl 5.20，否则请勿使用$`。它可以严重减慢所有正则表达式。有关详细信息和替代方案，请参阅perlvar。

Answer 2

使用或不使用perl有几种方法可以做到这一点。如果你想使用perl，你有几个选择：

读取所有文件并为所有ID设置哈希键，每次遇到ID时都会增加值。这样你就不得不一次读取每个文件，但对于500个文件，比如每个500k，我的猜测是它需要不到一分钟。你没有显示完整的程序，所以这是一个近似值，但它类似于：

`

use File::Slurp;
my %id_values;
foreach my $file (@xhtml_files @Duplicate_xhtml_files) {
    my @lines = read_file("$file");
    my $line_index = 1;
    foreach my $line (@lines) {
        my @array_ids = $line =~ m{( id="[^>"]+")}isg;
        my $id = '##'.join('##',@array_ids).'##';
        push @{$id_values{$id}}, { $file => $line_index };
        $line_index++;
    }
}
foreach my $key (keys(%id_values)) {
    if (@{$id_values} > 1) {
        foreach my $dup (@{$id_values}) {
            print $key . ':: ' . $file . ' - ' . $line;
        }
    }
}

`

这未经过测试，但它应该为您提供一般性的想法。

在解析文件之前对文件进行排序，这样您就可以使用二进制搜索来查找重复项。这将在您第一次运行时花费更长的时间，但是如果您需要多次运行它会更快。编写二进制搜索函数可能很棘手，但there's a CPAN module。

不使用perl，您可以使用bash来确定值，然后grep它们。类似的东西：

cat *.xhtml |cut -d'=' -f2 |sort uniq -c |sort >> duplicate_ids.txt

（“cut”命令是近似的，因为你没有提供一个例子）

Answer 3

好的，我建议不要做你正在做的事情，答案是使用解析器＆＃39;。

问题在于，如果没有更好的样本输入，我无法给出更明确的答案，但它应该是这样的：

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig;

my %seen;

sub seen_id {
    my ( $twig, $tag ) = @_;
    my $id = $tag->att('id');
    $tag -> print;
    return unless $id;
    print "Duplicate spotted $id", if $seen{$id}++;
}

my $twig = XML::Twig -> new ( twig_handlers => { 'tag[@id]' => \&seen_id } ); 

foreach my $xhtml ( glob "*.xhtml" ) {
    $twig -> parsefile ( $xhtml ); 
}

print join ( "\n", sort keys %seen );

根据我xhtml文件的大小，我可能会执行清除/丢弃/提前救助等操作。（磁盘IO通常是限制因素）。

或者可能改为：

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig;

my %seen;

foreach my $xhtml ( glob "*.xhtml" ) {
    my $twig = XML::Twig -> new(); 
    $twig -> parsefile ( $xhtml ); 
    #get the first element (of any) with an 'id' attribute) 
    my $id = $twig -> get_xpath('//*[@id]',0) -> att('id'); 
    print "$xhtml is a dupe\n" if $seen{$id}++; 
}

print join ( "\n", sort keys %seen );

验证n个文件中的值

3 个答案: