Parallel :: Forkmanager - >从子进程填充数组哈希的哈希值

时间:2016-10-31 15:05:53

标签: perl parallel-processing parent-child hash-of-hashes

我有这段代码,我想并行化(供参考):

my (%fastas, %counts);

foreach my $sample ( sort keys %AC2 )
{
    foreach my $chrom (sort keys %{ $AC2{$sample} } )
    {
        foreach my $pos ( sort { $a <=> $b } (@{ $allAC2{$chrom} }) )
        {
            my $allele;

            #position was genotyped in sample
            # or is AC=1, but was also found in AC=2
            if( grep(/\b$pos\b/, @{ $AC2{$sample}{$chrom} }) || grep(/\b$pos\b/, @{ $finalAC1{$sample}{$chrom} }) ) #"\b" is for word boundary -> exact word match
            {
                $allele = @{ $vcfs{$sample}{$chrom}{$pos} }[2]; #ALT allele
            }
            #Make sure all SNP positions are in all samples
            #Fill with reference genome allele information
            else
            {
                #Fill with reference genome allele information
                $allele = substr( @{ $ref{$chrom} }[0], $pos-1, 1); #or die "$sample, $chrom, $pos";
            }
            push ( @{ $fastas{$sample}{$chrom}{$pos} }, $allele);
            push ( @{ $counts{$chrom}{$pos} }, $allele) unless (grep {$_ eq $allele} @{ $counts{$chrom}{$pos} } );
        }
    }
}

基本上,子进程需要填充两个哈希值。我进行了搜索,但只找到了一些展示如何使用&#34; run_on_finish&#34;从子进程返回变量。 &#34;问题&#34;是我发现的所有示例/教程总是返回标量。

是否可以从子进程中传递哈希值(或2个哈希值)?

谢谢, 马可

2 个答案:

答案 0 :(得分:3)

不要返回哈希值,返回哈希引用。

Perl中的引用是标量值。因此,您只需要返回对%fastas%counts的引用。

这是一个来自Parallel :: Forkmanager文档的hacky示例。它使用与输入数据建议的元素一样多的元素在每个子进程中构建哈希。它将对该哈希的引用返回给父级,其中回调将其拾取并将其插入到$overall数据结构中。

use strict;
use warnings;
use Data::Printer;
use Parallel::ForkManager;

my $pm = Parallel::ForkManager->new(2);

my $overall; # will hold all results in the parent
$pm->run_on_finish( sub {
    my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data_structure_reference) = @_;

    $overall->{$pid} = $data_structure_reference;
});

DATA_LOOP:
foreach my $data (1 .. 10) {
  # Forks and returns the pid for the child:
  my $pid = $pm->start and next DATA_LOOP;

  my %child_result = map { $_ => 1 } 1 .. $data;

  $pm->finish( 0, \%child_result );
}

$pm->wait_all_children;
p $overall;

输出如下:

\ {
    1224   {
        1   1
    },
    1225   {
        1   1,
        2   1
    },
    1226   {
        1   1,
        2   1,
        3   1
    },
    1228   {
        1   1,
        2   1,
        3   1,
        4   1
    },
    1230   {
        1   1,
        2   1,
        3   1,
        4   1,
        5   1
    },
    1231   {
        1   1,
        2   1,
        3   1,
        4   1,
        5   1,
        6   1
    },
    1232   {
        1   1,
        2   1,
        3   1,
        4   1,
        5   1,
        6   1,
        7   1
    },
    1233   {
        1   1,
        2   1,
        3   1,
        4   1,
        5   1,
        6   1,
        7   1,
        8   1
    },
    1234   {
        1   1,
        2   1,
        3   1,
        4   1,
        5   1,
        6   1,
        7   1,
        8   1,
        9   1
    },
    1235   {
        1    1,
        2    1,
        3    1,
        4    1,
        5    1,
        6    1,
        7    1,
        8    1,
        9    1,
        10   1
    }
}

如果要返回两个数据结构,请将它们包装在数组引用中。

$pm->finish( 0, [ \%fastas, \%counts ] ); 

答案 1 :(得分:2)

我只是觉得我发布了我的解决方案:

my (%fastas, %counts);

#setting up the forking process
my $nCPU = Sys::CPU::cpu_count();
my $pm = Parallel::ForkManager -> new($nCPU);

$pm->run_on_finish(sub {
    my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data_structure_reference) = @_;

    %fastas = (%fastas, %{ $data_structure_reference->{fas} });
});

my @mySamples = sort keys %AC2;
my $s = $mySamples[0];

foreach my $sample (@mySamples)
{
    my $pid = $pm->start and next;

    my %allSeqs;

    foreach my $chrom (sort keys %{ $AC2{$sample} } )
    {
        foreach my $pos ( sort { $a <=> $b } (@{ $allAC2{$chrom} }) )
        {
            my $allele;

            #position was genotyped in sample
            # or is AC=1, but was also found in AC=2
            if( grep(/\b$pos\b/, @{ $AC2{$sample}{$chrom} }) || grep(/\b$pos\b/, @{ $finalAC1{$sample}{$chrom} }) ) #"\b is for word boundary -> exact word match"
            {
                $allele = @{ $vcfs{$sample}{$chrom}{$pos} }[2]; #ALT allele
            }
            #Make sure all SNP positions are in all samples
            #Fill with reference genome allele information
            else
            {
                #Fill with reference genome allele information
                $allele = substr( @{ $ref{$chrom} }[0], $pos-1, 1); #or die "$sample, $chrom, $pos";
            }

            push ( @{ $allSeqs{$sample}{$chrom}{$pos} }, $allele);
        }
    }
    $pm -> finish(0, { fas => \%allSeqs });
}

$pm -> wait_all_children();


#List ALT alleles found at each position
foreach my $sample ( sort keys %fastas )
{
    foreach my $chrom ( sort keys %{ $fastas{$sample} } )
    {
        foreach my $pos ( sort keys %{ $fastas{$sample}{$chrom} } )
        {
            my $allele = @{ $fastas{$sample}{$chrom}{$pos} }[0];
            push ( @{ $counts{$chrom}{$pos} }, $allele) unless (grep {$_ eq $allele} @{ $counts{$chrom}{$pos} } );
        }
    }
}

我必须从主循环中删除%计数并单独计算它,因为它必须在子进程中处理时引用它自己的值(来自父进程)(我希望这个解释有意义! )。

感谢大家的帮助,我非常感谢! 马可