使用Linux脚本在文件中插入定界符

时间:2019-03-04 14:19:51

标签: linux bash awk sed

我有一个无界文本文件,包含大约一百万行。

示例行

1YBL LOYALTY EXT 1000101172019001
2000100101000011512753184907301010614199100919699034659      VIDYA.SAGAR1@bank.IN                                     VIDYA SAGAR                             CROSS                                   BANDRA                                  WM                                      DELHI                         456471
3000000027

在以数字“ 2”,“ 1”,“ 3”(行类型)开头的每一行中,我必须根据字符数(即在0-1、1-20、21-25结尾)插入定界符。 ..等等

如何使用Linux脚本执行此操作?

所需的输出

1|YBL LOYALTY EXT |10001|01172019|001
2|00010010100001151|2753|184907301010614199100919699034659      |VIDYA.SAGAR1@bank.IN                                     |VIDYA SAGAR                             |CROSS                                   |BANDRA                                  |WM                                      |DELHI                         |456471
3|000000027

我尝试了此命令

perl -ne ' if(/^2/) { @x=(1,19,6,4,3,8,20,60,40,40,40,40,30); $i=0;
       while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ } 
       print "$_"}   if(/^1/) { @x=(1,16,5,8); $i=0;
       while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ } 
       print "$_" }  if(/^3/) { @x=(1); $i=0;
       while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ } 
       print "$_" }'  filename`

输入行

1YBL LOYALTY EXT 1000112102018001
2000100101000002631653184911501010111199100919323739251      VIJAYPANDEY1191@GMAIL.COM                                   VIJAY PANDEY                            PART OF GROUND FLOOR & BASEMENT         SHOPPER STOP SV ROAD ANDHERI WEST       LANDMARK-ERSTWHILE CRASSWORD BOOK STORE MUMBAI                        400058
2000100101000019920453184964321010513199000919878857482      MAKSUDMASTER7775@GMAIL.COM                                  MOHAMAD MAQSHUD MASTER                  H COLLECTION NEW SHIVPURI               GALI NO 1                               NEAR MAKHAN SINGH CHOWK                 LUDHIANA                      141008
2000100101000023500853184923441010913197300919375580888      JAYNTITALA@GMAIL.COM                                        JAYANTIBHAI TADA                        44 KHODIYAR NAGAR B S ABHISHEK          SUDAMA CHOWK                            KHODIYARNAGAR MOTA VARACHHA             SURAT                         395006
3000000066

预期产量

1|YBL LOYALTY EXT |10001|12102018|001
2|0001001010000026316|531849|1150|101|01111991|00919323739251      |VIJAYPANDEY1191@GMAIL.COM                                   |VIJAY PANDEY                            |PART OF GROUND FLOOR & BASEMENT         |SHOPPER STOP SV ROAD ANDHERI WEST       |LANDMARK-ERSTWHILE CRASSWORD BOOK STORE |MUMBAI                        |400058
2|0001001010000199204|531849|6432|101|05131990|00919878857482      |MAKSUDMASTER7775@GMAIL.COM                                  |MOHAMAD MAQSHUD MASTER                  |H COLLECTION NEW SHIVPURI               |GALI NO 1                               |NEAR MAKHAN SINGH CHOWK                 |LUDHIANA                      |141008
2|0001001010000235008|531849|2344|101|09131973|00919375580888      |JAYNTITALA@GMAIL.COM                                        |JAYANTIBHAI TADA                        |44 KHODIYAR NAGAR B S ABHISHEK          |SUDAMA CHOWK                            |KHODIYARNAGAR MOTA VARACHHA             |SURAT                         |395006
3|000000066

获取此信息

1|YBL LOYALTY EXT |10001|12102018|001
2|0001001010000026316|531849|1150|101|01111991|00919323739251      |VIJAYPANDEY1191@GMAIL.COM                                   |VIJAY PANDEY                            |PART OF GROUND FLOOR & BASEMENT         |SHOPPER STOP SV ROAD ANDHERI WEST       |LANDMARK-ERSTWHILE CRASSWORD BOOK STORE |MUMBAI                        |400058
2|0001001010000199204|531849|6432|101|05131990|00919878857482      |MAKSUDMASTER7775@GMAIL.COM                                  |MOHAMAD MAQSHUD MASTER                  |H COLLECTION NEW SHIVPURI               |GALI NO 1                               |NEAR MAKHAN SINGH CHOWK                 |LUDHIANA                      |141008
1|41008|
2|0001001010000235008|531849|2344|101|09131973|00919375580888      |JAYNTITALA@GMAIL.COM                                        |JAYANTIBHAI TADA                        |44 KHODIYAR NAGAR B S ABHISHEK          |SUDAMA CHOWK                            |KHODIYARNAGAR MOTA VARACHHA             |SURAT                         |395006
3|95006
3|000000066

5 个答案:

答案 0 :(得分:4)

使用GNU awk的FIELDWIDTHS:

$ awk -v FIELDWIDTHS='1 17 4 *' -v OFS='|' '/^2/{$1=$1; gsub(/\s+/,"&"OFS)} 1' file
1YBL LOYALTY EXT 1000101172019001
2|00010010100001151|2753|184907301010614199100919699034659      |VIDYA.SAGAR1@bank.IN                                     |VIDYA |SAGAR                             |CROSS                                   |BANDRA                                  |WM                                      |DELHI                         |456471
3000000027

FIELDWIDTHS的上述用法表示,应将输入分为四个宽度分别为1个字符,17个字符,4个字符的字段,然后将其余部分分开。

当您为字段分配值时,awk将重新编译记录,用OFS的值替换输入字段分隔符,因此$ 1 = $ 1导致|插入到FIELDWIDTHS描述的每个字段之间。 / p>

完成此操作后,仍然需要使用所有剩余的以空格分隔的文本来添加字段分隔符,以便gsub()在每一系列空格之后添加一个OFS。

较早版本的gawk不支持*的含义the rest of the line-如果遇到这种情况,只需将*替换为99999之类的大值。

答案 1 :(得分:1)

您也可以尝试Perl

perl -lpe ' if(/^2/) { @x=(1,17,4); 
           for $i (@x) { s/(.{$i})//; printf("%s|",$1) } }' input_file

具有给定的输入

$ cat rahman.txt
1YBL LOYALTY EXT 1000101172019001
2000100101000011512753184907301010614199100919699034659      VIDYA.SAGAR1@bank.IN                                     VIDYA SAGAR                             CROSS                                   BANDRA                                  WM                                      DELHI                         456471
3000000027

$ perl -lpe ' if(/^2/) { @x=(1,17,4); 
             for $i (@x) { s/(.{$i})//; printf("%s|",$1) } }' rahman.txt
1YBL LOYALTY EXT 1000101172019001
2|00010010100001151|2753|184907301010614199100919699034659      VIDYA.SAGAR1@bank.IN                                     VIDYA SAGAR                             CROSS                                   BANDRA                                  WM                                      DELHI                         456471
3000000027

$

只需将条目添加到@ x =(1,17,4).. @ x =(1,17,4,10,20)

EDIT1:

要为可按空格分割的字段添加定界符,请使用以下

$ perl -lpe ' if(/^2/) { @x=(1,17,4); 
             for $i (@x) { s/(.{$i})//; printf("%s|",$1) } s/\S+\s+\K/|/g }' rahman.txt
1YBL LOYALTY EXT 1000101172019001
2|00010010100001151|2753|184907301010614199100919699034659      |VIDYA.SAGAR1@bank.IN                                     |VIDYA |SAGAR                             |CROSS                                   |BANDRA                                  |WM                                      |DELHI                         |456471
3000000027

$

代码解释

Explanation
perl -lpe   # use -p for printing by default at the end of perl one-liner
        # this makes sure when you dont have a line starting with 2 the line is printed after the if statement.

' if(/^2/)  # if - select line that starts with 2. $_ will have the current line
{ 
@x=(1,17,4); # x is an array to hold the widths of fields. - 1, 17, 4 
for $i (@x)  # open for loop to loop through the array x
{ 
s/(.{$i})//;  # no variable is specified, so the substitution acts on the $_ i.e current line
          # first instance is s/(.{1})// => match one character and store it in $1 capturing variable
          # replace the captured part with nothing and update $_
          # e.g if the line is "200010010100001151" .. loop one will capture "2" and $_ becomes "00010010100001151"
          # loop 2 => s/(.{17})// matches 17 character and $1 stores "00010010100001151"
printf("%s|",$1)  # print $1 along with delimiter pipe 
}  # end of for loop
}  # end of if
# here is default print statement in perl that will print the $_ after all modification
' input_file

EDIT2

根据您的输入,我得到的结果如下。它可以正常工作..您看到什么问题?

$ perl -ne ' if(/^2/) { @x=(1,19,6,4,3,8,20,60,40,40,40,40,30); $i=0;
>        while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
>        print "$_"}   if(/^1/) { @x=(1,16,5,8); $i=0;
>        while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
>        print "$_" }  if(/^3/) { @x=(1); $i=0;
>        while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
>        print "$_" }'  rahman.txt
1|YBL LOYALTY EXT |10001|01172019|001
2|0001001010000115127|531849|0730|101|06141991|00919699034659      |VIDYA.SAGAR1@bank.IN                                     VID|YA SAGAR                             CRO|SS                                   BAN|DRA                                  WM |                                     DEL|HI                         456|471
3|000000027

$

EDIT3:

解决了这个问题... $ _被修改,因此在/ ^ 2 / if循环的末尾,$ _保持值为“ 141008”,然后满足下一个if(/ ^ 1 /)要避免这种情况,只需在开始时将$ _复制到$ line变量中,然后在单独的if循环中针对/ ^ 2 /,/ ^ 3 /,/ ^ 1 /检查$ line

$ perl -lne '$line=$_; if($line=~/^2/) { @x=(1,19,6,4,3,8,20,60,40,40,40,40,30); $i=0;
       while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
        print "$_" }
       if($line=~/^1/) { @x=(1,16,5,8); $i=0;
       while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
        print "$_" }
       if($line=~/^3/) { @x=(1); $i=0;
       while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
       print "$_" }'  rahman2.txt
1|YBL LOYALTY EXT |10001|12102018|001
2|0001001010000026316|531849|1150|101|01111991|00919323739251      |VIJAYPANDEY1191@GMAIL.COM                                   |VIJAY PANDEY                            |PART OF GROUND FLOOR & BASEMENT         |SHOPPER STOP SV ROAD ANDHERI WEST       |LANDMARK-ERSTWHILE CRASSWORD BOOK STORE |MUMBAI                        |400058
2|0001001010000199204|531849|6432|101|05131990|00919878857482      |MAKSUDMASTER7775@GMAIL.COM                                  |MOHAMAD MAQSHUD MASTER                  |H COLLECTION NEW SHIVPURI               |GALI NO 1                               |NEAR MAKHAN SINGH CHOWK                 |LUDHIANA                      |141008
2|0001001010000235008|531849|2344|101|09131973|00919375580888      |JAYNTITALA@GMAIL.COM                                        |JAYANTIBHAI TADA                        |44 KHODIYAR NAGAR B S ABHISHEK          |SUDAMA CHOWK                            |KHODIYARNAGAR MOTA VARACHHA             |SURAT                         |395006
3|000000066

$

答案 2 :(得分:0)

文件中确实有定界符,只是看不到它们:空格/制表符。因此,您只需要使用sed/xxx/|/g命令替换它们(xxx是指空格或TAB字符)。如果您不确定字符是空格还是制表符,则可以在十六进制编辑器中打开文件(空格为ASCII代码32(十六进制:20),TAB为9(十六进制:09))。

答案 3 :(得分:0)

您可以尝试使用gnu sed:

sed -E '/^2/{s//&|/;s/(.{19})(....)(\S+\s+)/\1|\2|\3|/}' infile

答案 4 :(得分:0)

如果您没有FIELDSWIDTHS,请尝试遵循。

awk -v var="1,18,4" -v OFS="|" '
BEGIN{
  num=split(var,array,",")
}
{
  for(i=1;i<=num;i++){
     val=val?(i==num?val substr($0,array[i-1]+1,array[i]):val substr($0,array[i-1]+1,array[i]) OFS):substr($0,1,array[i]) OFS
     sum+=array[i]
  }
  if(sum==length($0)){
    print val
  }
  else{
    rest=substr($0,sum)
    gsub(/[[:space:]]+/,"&"OFS,rest)
    print val,rest
  }
    sum=rest=val=""
}
'   Input_file