Question

我正在处理如下文件：

site Date time value1 value2
0023 2014-01-01 00:00 32.0 23.7
0023 2014-01-01 01:00 38.0 29.9
0023 2014-01-01 02:00 85.0 26.6
0023 2014-01-01 03:00 34.0 25.3
0023 2014-01-01 04:00 37.0 23.8
0023 2014-01-01 05:00 80.0 20.3
0023 2014-01-01 06:00 90.0 20.0
0023 2014-01-01 07:00 180.0 20.0
0023 2014-01-01 08:00 30.0 20.0

第一列是站点，第二列是日期（2014年全年），第三列表示时间（每天从00:00到23:00），第四列和第五列是值。我需要根据以下条件比较第4列和第5列：

对于每个站点（第1列），如果第4列超过列5的3倍，并且此模式持续持续时间超过或等于3小时，加上它们的最大值必须大于100，则打印所有符合标准的行，并计算每个站点存在多少个案例。总共有大约150个站点，每个站点每天都有每小时数据。这是我想要的输出：

0023 2014-01-01 05:00 80.0 20.3 1
0023 2014-01-01 06:00 90.0 20.0 1
0023 2014-01-01 07:00 180.0 20.0 1
0023 2014-06-30 23:00 200.0 30.3 2
0023 2014-07-01 00:00 303.0 30.3 2
0023 2014-07-01 01:00 134.0 30.3 2
0025 2014-07-01 01:00 136.0 25.3 1           
0025 2014-07-01 02:00 116.0 25.3 1
0025 2014-07-01 03:00 106.0 25.3 1

非常感谢任何帮助！

Answer 1

我试着猜测你的想法，首先这是我的测试输入：

#cat in2

This is the test input.
The 2nd and 3thd column can be any
so I didn't fill it now.

Abreviation: 3XCDT == "3 times $4>$5 condition"
           : because == bcs
site Date time value1 value2
0023 NC   NC   4.0  1.0
0023 NC   NC   4.0  1.0
0023 NC   NC   4.0  1.0
Next data ln breaks the 3XCDT
 but won't be prnt bcs mxvl<100 was.
00023 NC   NC   1.1  1.0 
00023 NC   NC   5.1  1.0 
00023 NC   NC   5.1  1.0 
00023 NC   NC   5.1  1.0 
Next data ln is new site
 but won't be prnt bcs mxvl<100 was.
00024 NC   NC   6.2  1.0 
00024 NC   NC   100.2  1.0 
00024 NC   NC   6.2  1.0 
Next data ln breaks the 3XCDT
 and will be prnt.
00024 NC   NC   1.1  1.0
00024 NC   NC   200.1  1.0
00024 NC   NC   200.1  1.0
Next data ln new site
 but won't be prnt bcs filo was smaller than 3
00025 NC   NC   7.1  1.1
00025 NC   NC   107.1  1.1
00025 NC   NC   7.1  1.1
Next data ln breaks the 3XCDT
 and will be prnt bcs mxvl>100 was.
00025 NC   NC   1.1  1.1
00025 NC   NC   8.1  1.1
00025 NC   NC   108.1  1.1
00025 NC   NC   8.1  1.1
No more data ln but the END condition
will prnt and see the counter will be 2

输出：

#./msr in2
00024 NC   NC   6.2  1.0  1
00024 NC   NC   100.2  1.0  1
00024 NC   NC   6.2  1.0  1
00025 NC   NC   7.1  1.1 1
00025 NC   NC   107.1  1.1 1
00025 NC   NC   7.1  1.1 1
00025 NC   NC   8.1  1.1 2
00025 NC   NC   108.1  1.1 2
00025 NC   NC   8.1  1.1 2

我的awk程序：#cat msr

#!/bin/bash
(($#!=1))&& { echo "Usage $0 inp_file"; exit 1; }

awk '
 BEGIN {ix=0; stn=-1;}                             # ix: index of filo, stn: sitnum non-exist  
 $1~"[^0-9]" || NF!=5 || $4 $5 ~ "[^0-9.]" {next;} # skip no data lines

 $1 != stn  {chck_prnt(); stn=$1; stc=1;}          # new site, set counter to 1
 $4 < 3*$5  {chck_prnt(); next;}                   # broken the col4>3*col5
            {filo[ix++]=$0; if($4>mxvl)mxvl=$4;}   # put candidate data into filo & refresh mxvl
 END        {chck_prnt();}                         # no more data line

 #func for prnt if need it & clr filo, mxvl
 function chck_prnt(  i){                          # (i is a local var) 
    if(ix>=3 && mxvl>100){                         # prnt condition 
        for(i=0; i<ix; i++)printf("%s %d\n", filo[i],stc); # prnt all filo
        stc++;                                     # increas counter at site
    }
    ix=0; mxvl=0;                                  # clr filo & maxvl
 }
' $1

@Kelly，在我可以更改代码之前，我需要进一步澄清规范。考虑到你的新例子，请告诉我为什么需要增加计数器（在行尾）？

由于日期列已更改（2014-01-01 - ＆gt; 2014-01-02）。
或者只是我们需要计算每个三重线。如果我看到你的第一个例子我会投票你想要算三胞胎线？

其他问题：我是否认为时间栏（hh：mm）无关紧要？（因为3行至少3小时？）

两个连续行之间（在给定的站点中）总是经过1小时？

（我从你的第一条评论中想到，一切都很好，只有你忘了给它打勾。）

/我认为你可以多次编辑你的第一个问题，可能比发布更新的Q更好./

Answer 2

@Kelly，你的评论不是我的问题的合适答案（在我看来）。但我试着再次猜出真正的规格是什么。

我希望关键的想法是：当两个连续行之间的不同时间大于 1小时我们还需要打印候选人（来自filo），如果＆＃34;标准＆＃34;对他们来说是对的。

我需要创建一个加号函数来计算时差。当filo不空时，主要部分有一个加号线来调用它。我还设置了一个过滤行来检查输入中的日期和时间格式。

请注意：我的chk1h（）函数将足够用于此目的，但是还有其他可能计算时间戳之间的时差：

gawk中的

1 / maketime（）函数。

2 / bash shell date命令例如：date＆＃34; +％s＆＃34; -d＆＃34; 2014-03-28 11：48：30＆＃34;

我用您的输入和其他人检查了程序，这没关系。

如果还有问题，您需要提供更长且完整的代表性输入系列。 不要给出2个列表（输入和输出）。输入列表就足够了，并在要显示的行的末尾写入相应的站点计数器。不打印的行不会有第6列。

cat msr2

#!/bin/bash
(($#!=1))&& { echo "Usage $0 inp_file"; exit 1; }
awk '
 BEGIN {ix=0; stn=-1;}                             # ix: index of filo, stn: non-exist
 $1~"[^0-9]" || NF!=5 || $4 $5 ~ "[^0-9.]" {next;} # skip no data lines
 $2 " " $3 !~ "^[1-2][0-9][0-9][0-9]-[0-1][0-9]-[0-3][0-9] [0-2][0-9].00$" {  # dt&tm format filtering
     printf("Unexpected dt,tm format:\nInput ln:%d\nContent: %s\n",NR,$0); exit(1);}

 $1 != stn  {chck_prnt(); stn=$1; stc=1;}          # new site, set counter to 1
 $4 < 3*$5  {chck_prnt(); next;}                   # broken the col4>3*col5
 ix         {if(!chk1h(ld, lt, $2, $3))chck_prnt();} # filo not empty-->need to chck 1h diff
            {filo[ix++]=$0; ld=$2; lt=$3;          # put into filo & set last dt,tm,mxvl
                            if($4>mxvl)mxvl=$4;}
 END        {chck_prnt();}                         # no more data line

 function chck_prnt(  i){                          # (i is a local var) 
    if(ix>=3 && mxvl>100){                         # prnt condition 
        for(i=0; i<ix; i++)printf("%s %d\n", filo[i],stc); # prnt all filo
        stc++;                                     # increase counter at site
    }
    ix=0; mxvl=0;                                  # clr filo & maxvl
 }

 function chk1h(d1,t1,d2,t2,  h1,h2,dy,dm,dd){     # ret 1 if dt of current ln - last dt in filo == 1h othrwise 0
   h1=substr(t1,1,2); h2=substr(t2,1,2);
   if(h2-h1==1 && d1==d2)return(1);                # most of case in same day 1h
   if(h1!=23||h2!="00")return(0);                  # not 1h
   split(d1,v1,"-"); split(d2,v2,"-");             # v1[1-3]=ymd last in filo, v2[1-3] current
   dy=v2[1]-v1[1];                                 # diff of year
   dm=v2[2]-v1[2];                                 # diff of month
   dd=v2[3]-v1[3];                                 # diff of day
   if(dd==1 && !dy && !dm)return(1);               # 23h-->00h 1h in same month
   if(v2[3]!="01")return(0);                       # not 1h
   if(v1[3]==31)                                   # chng of month, three type of prev month
       if(!dy && dm==1 || dy==1 && dm==-11)return(1); # 1h
       else return(0);                             # not 1h
   if(v1[3]==30)
       if("04 06 09 11" ~ v1[2] && !dy && dm==1)return(1); # 1h
       else return(0);                             # not 1h
   if("28 29" ~ v1[3] && v1[2]=="02" && !dy && dm==1)return(1); # 1h
   return(0);                                      # not 1h
 }
' $1

./ msr2 in3

0023 2014-01-01 21:00 90.0 20 1
0023 2014-01-01 22:00 80.0 20 1
0023 2014-01-01 23:00 130.0 20 1
0023 2014-01-02 16:00 130.0 20 2
0023 2014-01-02 17:00 200.0 30.3 2
0023 2014-01-02 18:00 303.0 30.3 2

但是如果你想要格式化输出，那么就这样运行：

./msr2 in3|awk 'NF==6&&$1!~"[^0-9]"{printf("%s %s %s %6.1f %6.1f %3u\n",$1,$2,$3,$4,$5,$6);}'

0023 2014-01-01 21:00   90.0   20.0   1
0023 2014-01-01 22:00   80.0   20.0   1
0023 2014-01-01 23:00  130.0   20.0   1
0023 2014-01-02 16:00  130.0   20.0   2
0023 2014-01-02 17:00  200.0   30.3   2
0023 2014-01-02 18:00  303.0   30.3   2

Answer 3

@László，代码有问题：我刚发现输入不连续，如下面的输入：

0023 2014-01-01 21:00 90.0 20
 0023 2014-01-01 22:00 80.0 20
 0023 2014-01-01 23:00 130.0 20
 0023 2014-01-02 16:00 130.0 20
 0023 2014-01-02 17:00 200.0 30.3
 0023 2014-01-02 18:00 303.0 30.3

以下是代码的输出：

0023 2014-01-01 21:00 90.0 20 1
0023 2014-01-01 22:00 80.0 20 1
0023 2014-01-01 23:00 130.0 20 1
0023 2014-01-02 16:00 130.0 20 1
0023 2014-01-02 17:00 200.0 30.3 1
0023 2014-01-02 18:00 303.0 30.3 1

但所需的输出应为：

0023 2014-01-01 21:00 90.0 20   1  
0023 2014-01-01 22:00 80.0 20    1
0023 2014-01-01 23:00 130.0 20   1
0023 2014-01-02 16:00 130.0 20   2
0023 2014-01-02 17:00 200.0 30.3 2 
0023 2014-01-02 18:00 303.0 30.3 2

很抱歉将其作为答案发布，因为发布评论时间太长。

在awk中计算超过三个连续行的值

3 个答案: