添加/附加大文件的有效方法

时间:2018-04-02 10:57:36

标签: bash shell perl sed substr

下面是一个shell脚本,用于处理一个巨大的文件。它通常逐行读取固定长度的文件,执行子字符串并作为分隔文件附加到另一个文件中。它工作得很好,但速度太慢了。

array=() # Create array
       while IFS='' read -r line || [[ -n "$line" ]] # Read a line
       do
      coOrdinates="$(echo -e "${line}" | grep POSITION | cut -d'(' -f2 | cut -d')' -f1 | cut -d':' -f1,2)"
          if   [[ -z "${coOrdinates// }" ]];
          then
  echo "Not adding"
          else
  array+=("$coOrdinates")
  fi
       done < "$1_CTRL.txt"

while read -r line;
  do
          result='"'
          for e in "${array[@]}"
          do
          SUBSTRING1=`echo "$e" | sed 's/.*://'`
          SUBSTRING=`echo "$e" | sed 's/:.*//'`
          result1=`perl -e "print substr('$line', $SUBSTRING,$SUBSTRING1)"`
          result1="$(echo -e "${result1}" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')"
          result=$result$result1'"'',''"'
          done
          echo $result >> $1_1.txt
  done < "$1.txt"

早些时候,我使用过cut命令并按上述方式进行了更改,但所用时间没有任何改善。 请建议可以做些什么样的改动来缩短处理时间.. 提前致谢

更新

输入文件的示例内容:

XLS01G702012        000034444132412342134

控制文件:

OPTIONS (DIRECT=TRUE, ERRORS=1000, rows=500000) UNRECOVERABLE
  load data
   CHARACTERSET 'UTF8'
   TRUNCATE
   into table icm_rls_clientrel2_hg
   trailing nullcols
   (
   APP_ID POSITION(1:3) "TRIM(:APP_ID)",
   RELATIONSHIP_NO POSITION(4:21) "TRIM(:RELATIONSHIP_NO)"
  )

输出文件:

"LS0","1G702012 0000"

4 个答案:

答案 0 :(得分:4)

perl的:

#!/usr/bin/env perl
use strict;
use warnings;
use autodie;

# read the control file
my $ctrl;
{
    local $/ = "";
    open my $fh, "<", shift @ARGV;
    $ctrl = <$fh>;
    close $fh;
}
my @positions = ( $ctrl =~ /\((\d+):(\d+)\)/g );

# read the data file
open my $fh, "<", shift @ARGV;
while (<$fh>) {
    my @words;
    for (my $i = 0; $i < scalar(@positions); $i += 2) {
        push @words, substr($_, $positions[$i], $positions[$i+1]);
    }
    say join ",", map {qq("$_")} @words;
}
close $fh;
perl parse.pl x_CTRL.txt x.txt
"LS0","1G702012        00003"

您要求的结果不同:

  • 在控制文件的POSITION(m:n)语法中,n是一个长度或一个 索引?
  • 数据文件中的
  • 是那些空格还是标签?

答案 1 :(得分:2)

更新了答案

这是一个用awk解析控制文件的版本,保存字符位置,然后在解析输入文件时使用它们:

awk '
/APP_ID/ {
     sub(/\).*/,"")   # Strip closing parenthesis and all that follows
     sub(/^.*\(/,"")  # Strip everything up to opening parenthesis
     split($0,a,":")  # Extract the two character positions separated by colon into array "a"
     next
   }
/RELATIONSHIP/ {      
     sub(/\).*/,"")      # Strip closing parenthesis and all that follows
     sub(/^.*\(/,"")     # Strip everything up to opening parenthesis
     split($0,b,"[():]") # Extract character positions into array "b"
     next
   }

FNR==NR{next}

{ f1=substr($0,a[1]+1,a[2]); f2=substr($0,b[1]+1,b[2]); printf("\"%s\",\"%s\"\n",f1,f2)}
' ControlFile InputFile

原始答案

这不是一个完整,严谨的答案,但是一旦您从控制文件中获得了POSITION参数,这应该让您了解如何使用awk进行提取:

awk -v a=2 -v b=3 -v c=5 -v d=21 '{f1=substr($0,a,b); f2=substr($0,c,d); printf("\"%s\",\"%s\"\n",f1,f2)}' InputFile

示例输出

"LS0","1G702012        00003"

尝试在大输入文件上运行它以了解性能,然后调整输出。阅读控制文件并不是时间关键,因此不必为优化它而烦恼。

答案 2 :(得分:2)

我建议用纯粹的bash并避免使用子弹:

if [[ $line =~ POSITION ]] ; then      # grep POSITION 
    coOrdinates="${line#*(}"           # cut -d'(' -f2
    coOrdinates="${coOrdinates%)*}"    # cut -d')' -f1
    coOrdinates="${coOrdinates/:/ }"   # cut -d':' -f1,2
    if   [[ -z "${coOrdinates// }" ]]; then
        echo "Not adding"
    else
        array+=("$coOrdinates")
    fi
fi

更高效,gniourf_gniourf:

if [[ $line =~ POSITION\(([[:digit:]]+):([[:digit:]])\) ]]; then 
    array+=( "${BASH_REMATCH[*]:1:2}" )
fi

类似地:

SUBSTRING1=${e#*:} # $( echo "$e" | sed 's/.*://' )
SUBSTRING= ${e%:*} # $( echo "$e" | sed 's/:.*//' )

# to confirm, I don't know perl substr 
result1=${line:$SUBSTRING:$SUBSTRING1} # $( perl -e "print substr('$line', $SUBSTRING,$SUBSTRING1)" )


#result1= # "$(echo -e "${result1}" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')"
# trim, if nécessary?
result1="${result1%${result1##*[^[:space:]]}}"    # right
result1="${result1#${result1%%[^[:space:]]*}}"    # left

gniourf_gniourf建议让grep退出循环:

while read ...; do
 ...
done < <(grep POSITION ...) 

提高效率:在Bash中/ read循环速度非常慢,因此尽可能多地进行预过滤将会大大加快这一过程。

答案 3 :(得分:0)

要避免(慢)while循环,您可以使用cutpaste

#!/bin/bash    
inFile=${1:-checkHugeFile}.in
ctrlFile=${1:-checkHugeFile}_CTRL.txt
outFile=${1:-checkHugeFile}.txt
cat /dev/null > $outFile

typeset -a array # Create array
while read -r line # Read a line
do
    coOrdinates="${line#*(}"
    coOrdinates="${coOrdinates%%)*}"
    [[ -z "${coOrdinates// }" ]] && { echo "Not adding"; continue; }
    array+=("$coOrdinates")
done < <(grep POSITION "$ctrlFile"  )
echo coOrdinates: "${array[@]}"

for e in "${array[@]}"
do
    nr=$((nr+1))
    start=${e%:*}
    len=${e#*:}
    from=$(( start + 1 ))
    to=$(( start + len + 1 ))
    cut -c$from-$to $inFile > ${outFile}.$nr
done
paste $outFile.* | sed -e 's/^/"/' -e 's/\t/","/' -e 's/$/"/' >${outFile}
rm $outFile.[0-9]