将信息从文本文件放入表中

时间:2013-12-27 14:06:43

标签: bash

我试图从文本日志文件(Bowtie2测序对齐器)中提取一些信息并将它们呈现在表格中。文本文件如下所示:

Time loading reference: 00:00:00
Time loading forward index: 00:00:00
Time loading mirror index: 00:00:00
Multiseed full-index search: 00:21:50
3746112 reads; of these:
  3746112 (100.00%) were paired; of these:
    2937631 (78.42%) aligned concordantly 0 times
    581094 (15.51%) aligned concordantly exactly 1 time
    227387 (6.07%) aligned concordantly >1 times
    ----
    2937631 pairs aligned 0 times concordantly or discordantly; of these:
      5875262 mates make up the pairs; of these:
        5382980 (91.62%) aligned 0 times
        400492 (6.82%) aligned exactly 1 time
        91790 (1.56%) aligned >1 times
28.15% overall alignment rate
Time searching: 00:21:50
Overall time: 00:21:50

我使用以下命令定义了一些变量,其中一些变量有两个字符串,例如,RDS_T相等,在以下文件中 3746112(100.00%)

RDS_T=`awk NR==5 GW2.log | awk '{print $1}'` #total number of reads
RDS_P=`awk NR==6 GW2.log | awk '{print $1, $2}'` #Paired reads and percentage (2 fields)
RDS_C1=`awk NR==8 GW2.log | awk '{print $1, $2}'` #concordantly once and percentage (2 fields)
RDS_C2=`awk NR==9 GW2.log | awk '{print $1, $2}'` #concordantly twice and percentage (2 fields)
ALGN_T=`awk NR==16 GW2.log | awk '{print $1}'`

我用这个来制作一张桌子,但它并没有那么好用:

printf "File\t Reads\t Paired reads\t Conc reads1\t Conc Reads2\t Total align\n\n\n GW1\t "%s$RDS_T\t" "%s" "$RDS_P"\t "%s" "$RDS_C1"\t "%s" "$RDS_C2"\t "%s$ALGN_T"\n"

虽然是单独的,但这些有效:

printf "%s$RDS_T 

printf "%s" "$RDS_P"

我注意到的一件事是\ t不被解释

任何想法如何做到这一点,我对bash很新,所以试着保持温柔:)?

非常感谢, 盖

3 个答案:

答案 0 :(得分:0)

无需多次致电awk。您可以使用单个awk脚本执行所有操作。请尝试以下命令:

awk -f t.awk GW2.log 

其中t.awk是:

NR==5 {
    RDS_T=$1
}
NR==6 {
    RDS_P=$1" "$2
}
NR==8 {
    RDS_C1=$1" "$2
}
NR==9 {
    RDS_C2=$1" "$2
}
NR==16 {
    ALGN_T=$1
}

END {
    fmt="%-12s %-12s %-18s %-18s %-18s %-18s\n" 
    printf fmt, "File", "Reads", "Paired reads", "Conc reads1", "Conc Reads2", "Total align"
    printf fmt,  "GW2.log", RDS_T, RDS_P, RDS_C1, RDS_C2, ALGN_T
}

带输出:

File         Reads        Paired reads       Conc reads1        Conc Reads2        Total align       
GW2.log      3746112      3746112 (100.00%)  581094 (15.51%)    227387 (6.07%)     28.15%  

答案 1 :(得分:0)

您没有正确使用printf

printf命令的用法是:printf format [arguments]。 (参见man页面。)

例如:

printf "My name is %s. I live in %s.\n" "John" "London"

因此,请将命令更改为:

printf "File\tReads\tPaired reads\tConc reads1\tConc Reads2\tTotal align\nGW1\t%s\t%s\t%s\t%s\t%s\n" "$RDS_T" "$RDS_P" "$RDS_C1" "$RDS_C2" "$ALGN_T"

答案 2 :(得分:0)

这是我的最终剧本,我使用了dogbane选项,因为我想引入一个循环并且只有一个文件(即没有额外的.awk文件),但我没有使用HåkonHægland方法(尽管我'我很乐意学习如何使用当前脚本执行此操作)。因此,脚本将为RNA-seq执行Bowtie2命令,生成相关目录并将每个.sam和.log文件(来自每个序列库)放在命令中生成的这些目录中。最后,该命令将生成一个小的.txt表,其中包含.log文件中的一些信息(例如,读取的总数)。我打算尝试完成脚本,例如它也会执行Tophat2,Cufflinks等'并且可能会从这些文件中吐出一些信息,如图形(使用Cuffdif和Cummerband)

#!/bin/bash
#run from /rdata/ngseq/Playground/guy/bowtie2
#to execute run: /localhome/gw57/Notes/pipeline3.sh
#Generates a "Summary.txt" file from the GW files

INPUT=/rdata/ngseq/original_data/rna/illumina/2013-05-05_Guy #
DATE=$(date +%d%m%y) #needs to add hours when run more than once per day
ROOT=140213_root_No_8
BT2INDEX=Bowtie2Index_Arabidopsis/genome


for i in {1..4}
do
    if [ ! -d ./$ROOT ]
    then 
        mkdir ./$ROOT/
    fi

    if [ ! -d ./$ROOT/$DATE"_run" ]
    then
        mkdir ./$ROOT/$DATE"_run"
    fi

        mkdir ./$ROOT/$DATE"_run"/GW$i

        bowtie2 --local -q -5 30 -3 30 --phred33 -N 1 -L 10 --no-discordant -t --no-unal -p 12 -x $BT2INDEX -1\
         $INPUT/GW$i/fastq/R1.fastq -2 $INPUT/GW$i/fastq/R2.fastq\
          -S ./$ROOT/$DATE"_run"/GW$i/GW$i.sam 2>&1 | tee -a $ROOT/$DATE"_run"/GW$i/GW$i.log
done


printf "%-18s%-18s%-18s%-18s%-18s%-18s\n\n"\
 "File" "Reads" "Paired_reads" "Conc Reads_once" "Conc_Reads>1" "Total_reads" > $ROOT/$DATE"_run"/Summary.txt

for i in {1..4}
do
    RDS_T=`awk 'NR==5 {print $1}' $ROOT/$DATE"_run"/GW$i/GW$i.log`
    RDS_P=`awk 'NR==6 {print $1, $2}' $ROOT/$DATE"_run"/GW$i/GW$i.log`
    RDS_C1=`awk 'NR==8 {print $1, $2}' $ROOT/$DATE"_run"/GW$i/GW$i.log`
    RDS_C2=`awk 'NR==9 {print $1, $2}' $ROOT/$DATE"_run"/GW$i/GW$i.log`
    ALGN_T=`awk 'NR==18 {print $1}' $ROOT/$DATE"_run"/GW$i/GW$i.log`

    printf "%-18s%-18s%-18s%-18s%-18s%-18s\n" "GW$i" "$RDS_T" "$RDS_P" "$RDS_C1" "$RDS_C2" "$ALGN_T"
done >> $ROOT/$DATE"_run"/Summary.txt