Question

目前，我在test.csv中有此数据示例：

    0004F2426603,74.214.224.150,16/Apr/2020
    0004F2426603,74.214.224.150,17/Apr/2020
    0004F2426603,74.214.224.150,18/Apr/2020
    00085D20A469,1.2.3.4,16/Apr/2020
    00085D20A469,1.2.3.4,17/Apr/2020
    00085D20A469,1.2.3.4,18/Apr/2020
    00085D20A469,8.8.8.8,16/Apr/2020
    64167F801BF5,1.2.3.4,16/Apr/2020
    64167F801BF5,1.2.3.4,17/Apr/2020
    64167F801BF5,1.2.3.4,18/Apr/2020
    64167F801BF5,8.8.8.8,16/Apr/2020

我一直在使用datamash根据第1列（MAC地址）进行分组，并分析IP地址。

我可以使它看起来像以下输出：

    datamash -st, -g1 unique 2 < test.csv
    0004F2426603,74.214.224.150
    00085D20A469,1.2.3.4,8.8.8.8
    64167F801BF5,1.2.3.4,8.8.8.8

    datamash -st, -g1,2 count 2 < test.csv
    0004F2426603,74.214.224.150,3
    00085D20A469,1.2.3.4,3
    00085D20A469,8.8.8.8,1
    64167F801BF5,1.2.3.4,3
    64167F801BF5,8.8.8.8,1

但是，我怎么能丢弃没有重复MAC的第一行，因为只有一个IP地址，并生成类似于以下内容的输出？

    00085D20A469,1.2.3.4,3,8.8.8.8,1
    64167F801BF5,1.2.3.4,3,8.8.8.8,1

如果有3个IP，则为此。

    64167F801BF5,1.2.3.4,3,8.8.8.8,1,9.9.9.9,1

我希望左侧的计数最少。我怀疑awk可以做到这一点，但是我确实很挣扎。

Answer 1

要折叠值，可以使用第二个命令的输出，将第一个,更改为其他分隔符，例如@与sed，然后将输出再次馈送到datamash中，并在第二个字段（由field2，field3等组合而成）中折叠。

$ datamash -st@ --output-delimiter=, -g1 collapse 2 \
  < <(datamash -st, -g1,2 count 2 < test.csv | sed 's/,/@/')
    0004F2426603,74.214.224.150,3
    00085D20A469,1.2.3.4,3,8.8.8.8,1
    64167F801BF5,1.2.3.4,3,8.8.8.8,1

如果我现在理解正确的话，如果现在要删除包含三个字段的第一个条目，则可以使用awk并打印包含三个以上字段的行：

$ datamash -st@ --output-delimiter=, -g1 collapse 2\
  < <(datamash -st, -g1,2 count 2 < test.csv | sed 's/,/@/') | awk -F, 'NF>3'
    00085D20A469,1.2.3.4,3,8.8.8.8,1
    64167F801BF5,1.2.3.4,3,8.8.8.8,1

Answer 2

您可以使用管道从csv下载数据。我将它们保存到文件fil1.txt中，以免混淆解决方案。

输入数据（fil1.txt）：

0004F2426603,74.214.224.150,3
00085D20A469,1.2.3.4,3
00085D20A469,8.8.8.8,1
64167F801BF5,4.3.2.1,3,3.3.3.3,2
64167F801BF5,9.9.9.9,1
0004F2426603,74.214.224.150,4

Awk脚本（fil1.awk）：

// {
  if (l==$1) {
    print($0","r)
  }
  l = $1
  r = $2
}

致电：

cat fil1.txt |sed 's/,/ /' |awk -f fil1.awk

输出：

00085D20A469 8.8.8.8,1,1.2.3.4,3
64167F801BF5 9.9.9.9,1,4.3.2.1,3,3.3.3.3,2

说明：

// match to each line (default action) 
If (l == $1) checks if variable l is equal to first field ($1),
For 1 line in the txt file, l has no value so it bypasses the content of 
brackets {} and assigns the first field of the first line to variable l 
and the second field of the first line to variable r
For the second line of the txt file l and $1 are different, therefore 
body if is not performed again
For 3 line of the txt file l and $1 are the same:
print($0","r) prints the entire line 3 (field $0), a literal comma 
and the stored field 2 from the previous line.

and everything repeats for the next lines of the txt file

就像您在注释中想要的那样，该版本适用于任意数量的重复MAC地址的排序行：

// {
  if (l == $1) {
  s = s","r
  }
  else {
    if (s != "") {
      printf("%s %s%s\n", l, r, s)
      s = ""
    }
  }
  l = $1
  r = $2
}

Answer 3

对于数组数组，使用GNU awk：

$ cat tst.awk
BEGIN { FS="," }
{ mac_ips[$1][$2]++ }
END {
    for ( mac in mac_ips ) {
        if ( length(mac_ips[mac]) > 1 ) {
            printf "%s", mac
            for ( ip in mac_ips[mac] ) {
                printf ",%s,%d", ip, mac_ips[mac][ip]
            }
            print ""
        }
    }
}

$ awk -f tst.awk file
00085D20A469,1.2.3.4,3,8.8.8.8,1
64167F801BF5,1.2.3.4,3,8.8.8.8,1

Answer 4

这是一个典型的SQL问题，因此您可以在Linux中使用sqlite3解决。试试看。

$ cat a.sh
#!/bin/sh
sqlite3 << EOF
create table t1(id, ip_addr,dt);
.separator ,
.import $1 t1
select id, group_concat(ip_addr||','||c1) from (
select id, ip_addr, count(*) c1 from t1 where id in (
select id from ( select id, ip_addr, count(*) c from t1 group by id, ip_addr) t group by id having count(id) >1)
group by id, ip_addr )
group by id
;

EOF
$ cat ip.dat
 0004F2426603,74.214.224.150,16/Apr/2020
 0004F2426603,74.214.224.150,17/Apr/2020
 0004F2426603,74.214.224.150,18/Apr/2020
 00085D20A469,1.2.3.4,16/Apr/2020
 00085D20A469,1.2.3.4,17/Apr/2020
 00085D20A469,1.2.3.4,18/Apr/2020
 00085D20A469,8.8.8.8,16/Apr/2020
 64167F801BF5,1.2.3.4,16/Apr/2020
 64167F801BF5,1.2.3.4,17/Apr/2020
 64167F801BF5,1.2.3.4,18/Apr/2020
 64167F801BF5,8.8.8.8,16/Apr/2020
$ a.sh ip.dat  # Execute a.sh by passing the file as parameter
 00085D20A469,1.2.3.4,3,8.8.8.8,1
 64167F801BF5,1.2.3.4,3,8.8.8.8,1
$

根据第1列合并多行

4 个答案: