AWK: add a sequential number out of 4 digits

时间:2017-06-09 12:58:25

标签: awk sed bioinformatics fasta tr

How do I achieve from following string.ext

>Lipoprotein releasing system transmembrane protein LolC
MKWLWFAYQNVIRNRRRSLMTILIIAVGTAAILLSNGFALYTYDNLREGSALASGHVIIAHVDHFDKEEEIPMEYGLSDYEDIERHIAADDRVRMAIPRLQFSGLISNGDKSVIFMGTGVDPEGEFDIGGVLTNVLTGNTLSTHSAPDAVPEVMLAKDLAKQLHADIGGLLTLLATTADGALNALDVQVRGIFSTGVPEMDKRMLAVALPTAQELIMTDKVGTLSVYLHEIEQTDAMWAVLAEWYPNFATQPWWEQASFYFKVRALYDIIFGVMGVIILLIVFFTITNTLSMTIVERTRETGTLLALGTLPRQIMRNFALEALLIGLAGALLGMLIAGFTSITLFIAEIQMPPPPGSTEGYPLYIYFSPWLYGITSLLVVTLSIAAAFLTSRKAARKPIVEALAHV
>Phosphoserine phosphatase (EC 3.1.3.3)
MFQEHALTLAIFDLDNTLLAGDSDFLWGVFLVERGIVDGDEFERENERFYRAYQEGDLDIFEFLRFAFRPLRDNRLEDLKRWRQDFLREKIEPAILPMACELVEHHRAAGDTLLIITSTNEFVTAPIAEQLGIPNLIATVPEQLHGCYTGEAAGTPAFQAGKVKRLLDWLEETSTELAGSTFYSDSHNDIPLLEWVDHPVATDPDDRLRGYARDRGWPIISLREEIAP

to change the sequential number after string to a 4 digit number (starting with 0001) and separate that number with | from string, so that output is returned like:

>string|0001|Lipoprotein_releasing_system_transmembrane_protein_LolC
MKWLWFAYQNVIRNRRRSLMTILIIAVGTAAILLSNGFALYTYDNLREGSALASGHVIIAHVDHFDKEEEIPMEYGLSDYEDIERHIAADDRVRMAIPRLQFSGLISNGDKSVIFMGTGVDPEGEFDIGGVLTNVLTGNTLSTHSAPDAVPEVMLAKDLAKQLHADIGGLLTLLATTADGALNALDVQVRGIFSTGVPEMDKRMLAVALPTAQELIMTDKVGTLSVYLHEIEQTDAMWAVLAEWYPNFATQPWWEQASFYFKVRALYDIIFGVMGVIILLIVFFTITNTLSMTIVERTRETGTLLALGTLPRQIMRNFALEALLIGLAGALLGMLIAGFTSITLFIAEIQMPPPPGSTEGYPLYIYFSPWLYGITSLLVVTLSIAAAFLTSRKAARKPIVEALAHV
>string|0002|Phosphoserine_phosphatase_(EC_3_1_3_3)
MFQEHALTLAIFDLDNTLLAGDSDFLWGVFLVERGIVDGDEFERENERFYRAYQEGDLDIFEFLRFAFRPLRDNRLEDLKRWRQDFLREKIEPAILPMACELVEHHRAAGDTLLIITSTNEFVTAPIAEQLGIPNLIATVPEQLHGCYTGEAAGTPAFQAGKVKRLLDWLEETSTELAGSTFYSDSHNDIPLLEWVDHPVATDPDDRLRGYARDRGWPIISLREEIAP

the commands I came up until here are ($faa is referring to the filename string.ext)

faa=$1
var=$(basename "$faa" .ext)

awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' $faa >$faa.tmp
sed 's/ /_/g' $faa.tmp >$faa.tmp2
awk -v var="$var" '/>/{sub(">","&"var"|");sub(/\.ext/,x)}1' $faa.tmp2 >$faa.tmp3
awk '/>/{sub(/\|/,++i"|")}1' $faa.tmp3 >$faa.tmp4
tr '\.' '_' <$faa.tmp4 | tr '\:' '_' | sed 's/__/_/g' >$faa.tmp5

Edit: I also want to change following characters to 1 underscore: / . :

4 个答案:

答案 0 :(得分:2)

我在这里使用perl:

perl -pe '
    next unless /^>/;     # only transform the "header" lines
    s/[\h.]/_/g;          # change dots and horizontal whitespace
    substr($_,1,0) = sprintf("string|%04d|", ++$n)  # insert the counter
' file

答案 1 :(得分:1)

在awk。

$ awk '/^>/{n=sprintf("%04d",++i);sub(/^>/,">string|" n "|")}1' file
>string|0001|Lipoprotein releasing system transmembrane protein LolC
MKWLWFAYQNVIRNRRRSLMTILIIAVGTAAILLSNGFALYTYDNLREGSALASGHVIIAHVDHFDKEEEIPMEYGLSDYEDIERHIAADDRVRMAIPRLQFSGLISNGDKSVIFMGTGVDPEGEFDIGGVLTNVLTGNTLSTHSAPDAVPEVMLAKDLAKQLHADIGGLLTLLATTADGALNALDVQVRGIFSTGVPEMDKRMLAVALPTAQELIMTDKVGTLSVYLHEIEQTDAMWAVLAEWYPNFATQPWWEQASFYFKVRALYDIIFGVMGVIILLIVFFTITNTLSMTIVERTRETGTLLALGTLPRQIMRNFALEALLIGLAGALLGMLIAGFTSITLFIAEIQMPPPPGSTEGYPLYIYFSPWLYGITSLLVVTLSIAAAFLTSRKAARKPIVEALAHV
>string|0002|Phosphoserine phosphatase (EC 3.1.3.3)
MFQEHALTLAIFDLDNTLLAGDSDFLWGVFLVERGIVDGDEFERENERFYRAYQEGDLDIFEFLRFAFRPLRDNRLEDLKRWRQDFLREKIEPAILPMACELVEHHRAAGDTLLIITSTNEFVTAPIAEQLGIPNLIATVPEQLHGCYTGEAAGTPAFQAGKVKRLLDWLEETSTELAGSTFYSDSHNDIPLLEWVDHPVATDPDDRLRGYARDRGWPIISLREEIAP

说明:

$ awk '
/^>/ {                          # if string starts with >
    n=sprintf("%04d",++i)       # iterate i from 1 and zeropad
    sub(/^>/,">string|" n "|")  # replace the > with stuff
}1' file                        # implicit output

请勿在{{1​​}}中加入&(请参阅评论)。

答案 2 :(得分:1)

awk -F'[ \.]' 'BEGIN{a=1;OFS="_"}/^>/{$1=sprintf(">String|%04d",a);++a;print $0; next;}{print $0}' filename

答案 3 :(得分:1)

$ awk '
    FNR==1 {base=FILENAME; sub(/\.[^.]+$/,"",base) }
    sub(/^>/,"") { gsub(/[\/ .:]+/,"_"); $0=sprintf(">%s|%04d|%s",base,++c,$0) }
1' string.ext
>string|0001|Lipoprotein_releasing_system_transmembrane_protein_LolC
MKWLWFAYQNVIRNRRRSLMTILIIAVGTAAILLSNGFALYTYDNLREGSALASGHVIIAHVDHFDKEEEIPMEYGLSDYEDIERHIAADDRVRMAIPRLQFSGLISNGDKSVIFMGTGVDPEGEFDIGGVLTNVLTGNTLSTHSAPDAVPEVMLAKDLAKQLHADIGGLLTLLATTADGALNALDVQVRGIFSTGVPEMDKRMLAVALPTAQELIMTDKVGTLSVYLHEIEQTDAMWAVLAEWYPNFATQPWWEQASFYFKVRALYDIIFGVMGVIILLIVFFTITNTLSMTIVERTRETGTLLALGTLPRQIMRNFALEALLIGLAGALLGMLIAGFTSITLFIAEIQMPPPPGSTEGYPLYIYFSPWLYGITSLLVVTLSIAAAFLTSRKAARKPIVEALAHV
>string|0002|Phosphoserine_phosphatase_(EC_3_1_3_3)
MFQEHALTLAIFDLDNTLLAGDSDFLWGVFLVERGIVDGDEFERENERFYRAYQEGDLDIFEFLRFAFRPLRDNRLEDLKRWRQDFLREKIEPAILPMACELVEHHRAAGDTLLIITSTNEFVTAPIAEQLGIPNLIATVPEQLHGCYTGEAAGTPAFQAGKVKRLLDWLEETSTELAGSTFYSDSHNDIPLLEWVDHPVATDPDDRLRGYARDRGWPIISLREEIAP

我假设从您发布的示例和代码中您确实希望将空格,句点,正斜杠和/或冒号的任何组合的每个连续序列转换为单个下划线。

相关问题