bash中的正则表达式 - 编辑行

时间:2017-05-02 15:53:27

标签: regex bash sed

我有几个文件包含以特定方式编写的行,例如:

>m.144 g.144  ORF g.144 m.144 type:internal len:123 (+) Pf1004_1/1_1.000_369:1-372(+)

我想使用带regexp的sed命令删除一些字符以便使用这种格式:

>Pf1004_1/1_1.000_369

但它不起作用:/。我使用了以下脚本:

 #/bin/bash

 for file in *.fasta # Set of fasta files in the script directory
 do      
     sed -i "s/.+?\(\+\) />/g" $file
     sed -i "s/:.+//g" $file
 done

有什么问题?以下是我的一个文件的概述:

>m.187 g.187  ORF g.187 m.187 type:internal len:115 (+) Ph1000_1/1_1.000_345:1-348(+)
LIILLTSVSVVVLLVENHLSPSHSVLDLSSEPPTGNATYHCWEVAETVIVIKECSPCSVF
EQKTNPACKETGYSQKVLCMLKDGTESKLPRSCPKITWVEEKQFWLFEVLMALLG
>m.188 g.188  ORF g.188 m.188 type:internal len:100 (+) Ph1002_1/1_1.000_302:1-303(+)
KTDTPRRQRSMSPVANVSCSPSVSSPNLLMKLLDSSDESESDTPHPNRVKVLKPDDMGIK
DFFKNTAAKQGLEERVDVSIQDFDHIINEASDRLPCTKKI
>m.189 g.189  ORF g.189 m.189 type:internal len:125 (+) Ph1007_1/1_1.000_376:1-378(+)
QSATPLHRAAEANRKQAVAELLHAGCDVNRQNEVSITPIFYPAQRGDDVTTRLLIQNGAD
PNVTDAEDWIPLHFASQNGHVATVDALTSARSMVNAAGSHGETPLLIAAEQGHDKVVKHL
LANGA
>m.190 g.190  ORF g.190 m.190 type:internal len:129 (+) Ph1010_1/1_1.000_387:1-390(+)
HVADTGTSSSPQLSPTHAERRPLKVEFIGMKDMASGDTSGRDKRPGVENDLKRINRKATN
CARYQQPRMSLLGKPLNYRAHKRDVRYRRAQAKVYNFLERPKDWRAISYHLLVYVELRDS
TLTVFHPSM
>m.191 g.191  ORF g.191 m.191 type:internal len:185 (+) Ph1014_1/1_1.000_555:1-558(+)
CLADLVTASDNMENDLSDNSNLDQSGTMYAFAAKRKSYGQVKDADHVDSGGDNPERQERP
MSPMCLKIRKSDNGLSPEARRPVTSPSPISPAAPVSDHVDADRDVIERAKELQKAELDKV
VASSFPVPQSGFRSVHSVDISPLHRISVPWPHPVHQPIFPHPHPVALQMSLSNSFRAQNP
DACIR
>m.192 g.192  ORF g.192 m.192 type:internal len:183 (+) Ph1025_1/1_1.000_551:1-552(+)
TQKDWRELLWTYCCCCSKRHVHAEDVDKSAVTSLSEVKAEKQLKSPAKIKTIRNHADVKS
ALSTSCLRRKKNFEEQTICKNELNVKHSDDDNRDMDKQDTKTAITLTPKCFVHFPKSVNH
LQLDQTPLYWGAVSKEAASLCSLPVRNGCTVAAVKDVQDPHLLEIGQVYQNDEEWTPKEL
TAD
>m.19 g.19  ORF g.19 m.19 type:internal len:348 (+) Ph103_1/1_1.000_1044:1-1047(+)
GGHLPSFNDRPGNTMAGSKDDKTNLSPVKLELISPCGPVLSNHVGCIVNNVLYIHGGINK
YLSKEPLNAFYKLNLNAPSPIWQEILDRNSPHLSHHACVVLDNRYLVLIGGWNGKQRTAD
MWAYDVQEAVWISLRTSGFPEGAGLSSHAALPLADGSILVIGREGSARIQRRYGNSWLIR
GSVMRGHFVYNEHQMSLASRSGHTMHVIGSDLTIIGGRSDRQVEQHGGYRTAMTSSAVAF
FSGLNQFVKRTPPMAKPPCGRKQHVSASGSGLILIHGGETFDGKSRHPVGDFYIISLRPT
VTWYHLGTSGVGRAGHVCCTAADKIIIHGGMGPRNAIYGDTYEISLSK
>m.193 g.193  ORF g.193 m.193 type:internal len:130 (+) Ph1046_1/1_1.000_390:1-393(+)
LFRLASESYHSSKMVQRLTLRRRLSYNTSSNRRRIVKTPGGRLVYHYTKKPGAIPICKSG
GCRTKLHGIRPSRPMQRRRMSKRLKTVNRTYGGVQCHTCVREKIIRAFLIEEQKIVVKVL
KAQAAQAKKA

4 个答案:

答案 0 :(得分:0)

将2 sed 表达式替换为以下表达式:

sed -E 's/^>.+\(\+\) ([^:]+):.+$/>\1/' $file

答案 1 :(得分:0)

为什么不这样做:

sed -e 's/^.*[ ]([+])[ ]/>/g' -e 's/[:].*$//' $file

第一个表达:

's/^.*[ ]([+])[ ]/>/g'

将从开头删除到第一个space,后跟(+)space

第二个表达:

's/[:].*$//'

只需将:到最后的所有内容剪掉。

示例

$ echo ">m.144 g.144  ORF g.144 m.144 type:internal len:123 (+) Pf1004_1/1_1.000_369:1-372(+)" | \
  sed -e 's/^.*[ ]([+])[ ]/>/g' -e 's/[:].*$//'
>Pf1004_1/1_1.000_369

答案 2 :(得分:0)

我认为问题可能是sed的正则表达式不是你所期望的。请参阅此处获取解释,尤其是“+”表示的内容:https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html

答案 3 :(得分:0)

根据数据结构的完整程度,这个简单的awk脚本就足够了:

awk -F '[ :]' '/^>/ { print ">" $12; next } 1' infile

输出:

>Ph1000_1/1_1.000_345
LIILLTSVSVVVLLVENHLSPSHSVLDLSSEPPTGNATYHCWEVAETVIVIKECSPCSVF
EQKTNPACKETGYSQKVLCMLKDGTESKLPRSCPKITWVEEKQFWLFEVLMALLG
>Ph1002_1/1_1.000_302
KTDTPRRQRSMSPVANVSCSPSVSSPNLLMKLLDSSDESESDTPHPNRVKVLKPDDMGIK
DFFKNTAAKQGLEERVDVSIQDFDHIINEASDRLPCTKKI
>Ph1007_1/1_1.000_376
QSATPLHRAAEANRKQAVAELLHAGCDVNRQNEVSITPIFYPAQRGDDVTTRLLIQNGAD
PNVTDAEDWIPLHFASQNGHVATVDALTSARSMVNAAGSHGETPLLIAAEQGHDKVVKHL
LANGA
>Ph1010_1/1_1.000_387
HVADTGTSSSPQLSPTHAERRPLKVEFIGMKDMASGDTSGRDKRPGVENDLKRINRKATN
CARYQQPRMSLLGKPLNYRAHKRDVRYRRAQAKVYNFLERPKDWRAISYHLLVYVELRDS
TLTVFHPSM
>Ph1014_1/1_1.000_555
CLADLVTASDNMENDLSDNSNLDQSGTMYAFAAKRKSYGQVKDADHVDSGGDNPERQERP
MSPMCLKIRKSDNGLSPEARRPVTSPSPISPAAPVSDHVDADRDVIERAKELQKAELDKV
VASSFPVPQSGFRSVHSVDISPLHRISVPWPHPVHQPIFPHPHPVALQMSLSNSFRAQNP
DACIR
>Ph1025_1/1_1.000_551
TQKDWRELLWTYCCCCSKRHVHAEDVDKSAVTSLSEVKAEKQLKSPAKIKTIRNHADVKS
ALSTSCLRRKKNFEEQTICKNELNVKHSDDDNRDMDKQDTKTAITLTPKCFVHFPKSVNH
LQLDQTPLYWGAVSKEAASLCSLPVRNGCTVAAVKDVQDPHLLEIGQVYQNDEEWTPKEL
TAD
>Ph103_1/1_1.000_1044
GGHLPSFNDRPGNTMAGSKDDKTNLSPVKLELISPCGPVLSNHVGCIVNNVLYIHGGINK
YLSKEPLNAFYKLNLNAPSPIWQEILDRNSPHLSHHACVVLDNRYLVLIGGWNGKQRTAD
MWAYDVQEAVWISLRTSGFPEGAGLSSHAALPLADGSILVIGREGSARIQRRYGNSWLIR
GSVMRGHFVYNEHQMSLASRSGHTMHVIGSDLTIIGGRSDRQVEQHGGYRTAMTSSAVAF
FSGLNQFVKRTPPMAKPPCGRKQHVSASGSGLILIHGGETFDGKSRHPVGDFYIISLRPT
VTWYHLGTSGVGRAGHVCCTAADKIIIHGGMGPRNAIYGDTYEISLSK
>Ph1046_1/1_1.000_390
LFRLASESYHSSKMVQRLTLRRRLSYNTSSNRRRIVKTPGGRLVYHYTKKPGAIPICKSG
GCRTKLHGIRPSRPMQRRRMSKRLKTVNRTYGGVQCHTCVREKIIRAFLIEEQKIVVKVL
KAQAAQAKKA