在Perl中解析管道分隔的文本文件

时间:2015-10-31 06:11:05

标签: perl parsing

我有以下文本文件:

CUI|SDUI|HpoStr|MedGenStr|MedGenStr_SAB|STY|
CN000002|HP:0000001|All|All|HPO|Finding|
CN000003|HP:0000002|Abnormality of body height|Abnormality of body height|GTR|Finding|
CN000004|HP:0000003|Multicystic kidney dysplasia|Multicystic kidney dysplasia|GTR|Finding|
CN000006|HP:0000005|Mode of inheritance|Mode of inheritance|HPO|Finding|
C0443147|HP:0000006|Autosomal dominant inheritance|Autosomal dominant inheritance|GTR|Intellectual Product|
C0441748|HP:0000007|Autosomal recessive inheritance|Autosomal recessive inheritance|HPO|Intellectual Product|
CN000009|HP:0000008|Abnormality of female internal genitalia|Abnormality of female internal genitalia|GTR|Finding|

我想用Perl解析它。这是我到目前为止所得到的:

#!/usr/bin/perl

open (FILE, 'filename.txt');

while (<FILE>) {
    chomp;
    ($CUI, $SDUI, $HpoStr, $MedGenStr, $MedGenStr_SAB, $STY) = split("\t");
    print "CUI: $CUI\n";
    print "SDUI: $SDUI\n";
    print "HpoStr: $HpoStr\n";
    print "MedGenStr: $MedGenStr\n";
    print "MedGenStr_SAB: $MedGenStr_SAB\n";
    print "STY: $STY\n";
    print "---------\n";
}

close (FILE);
exit;

当我使用nano编辑器运行它时,我确实得到了输出,但是当我使用像perl filename.pl这样的命令时,我有很多错误。我想知道我的代码是错误的还是有更好的方法来构建我的代码。

-1 down vote accept

上面代码中的情况我将输入作为单独的.txt文件 #

CUI | SDUI | HpoStr | MedGenStr | MedGenStr_SAB | STY |

CN000002 | HP:0000001 |所有|所有| HPO |查找| CN000003 | HP:0000002 |身高异常|身高异常| GTR | Fi nding | CN000004 | HP:0000003 |多囊肾发育不良|多囊肾发育不良| GT R |查找| CN000006 | HP:0000005 |继承模式|继承模式| HPO |查找| C0443147 | HP:0000006 |常染色体显性遗传|常染色体显性遗传| GTR |知识产品|家具装修,必找华美! C0441748 | HP:0000007 |常染色体隐性遗传|常染色体隐性遗传| HPO |智力产品|亚德诺半导体CN000009 | HP:0000008 |女性内生殖器异常|女性内生殖器异常| GTR |发现| #

如果我想用作文件输入选项我该如何去做?因为文件的大小就大到1GB。

这些是我必须将条目与这些标题相关联的头文件

1 个答案:

答案 0 :(得分:1)

您的列由管道(cbind(Anew[,a], Anew[,c],...,Anew[,h]) )分隔,而不是制表符,因此您需要拆分:

|

输出:

use strict;
use warnings;
use Data::Dump;

while (<DATA>) {
    chomp;
    my @fields = split(/\|/, $_);
    dd(\@fields);   
}

__DATA__
CUI|SDUI|HpoStr|MedGenStr|MedGenStr_SAB|STY|
CN000002|HP:0000001|All|All|HPO|Finding|
CN000003|HP:0000002|Abnormality of body height|Abnormality of body height|GTR|Finding|
CN000004|HP:0000003|Multicystic kidney dysplasia|Multicystic kidney dysplasia|GTR|Finding|
CN000006|HP:0000005|Mode of inheritance|Mode of inheritance|HPO|Finding|
C0443147|HP:0000006|Autosomal dominant inheritance|Autosomal dominant inheritance|GTR|Intellectual Product|
C0441748|HP:0000007|Autosomal recessive inheritance|Autosomal recessive inheritance|HPO|Intellectual Product|
CN000009|HP:0000008|Abnormality of female internal genitalia|Abnormality of female internal genitalia|GTR|Finding|

如果您想提供要阅读的文件,只需将["CUI", "SDUI", "HpoStr", "MedGenStr", "MedGenStr_SAB", "STY"] ["CN000002", "HP:0000001", "All", "All", "HPO", "Finding"] [ "CN000003", "HP:0000002", "Abnormality of body height", "Abnormality of body height", "GTR", "Finding", ] [ "CN000004", "HP:0000003", "Multicystic kidney dysplasia", "Multicystic kidney dysplasia", "GTR", "Finding", ] ... 更改为while (<DATA>)并运行如下脚本:while (<>)

如果您需要按名称访问字段,则需要哈希:

perl script.pl input.txt

输出:

my @headers;

while (<DATA>) {
    chomp;  
    my @fields = split(/\|/, $_);

    if ($. == 1) {
        @headers = @fields;
        next;
    }

    my %data;
    @data{@headers} = @fields;
    dd(\%data); 
}

但是,看起来您很快就会接近使用Text::CSV比尝试手动执行此操作更好的程度。