如何在Linux shell中删除文件中的n次重复行?

时间:2014-09-17 16:33:07

标签: linux shell uniq

我有几句话:

one
two
two
three

我有一个文件,其中每个单词重复n次。例如,在n = 2时,给定文件为:

one
two
two
three
two
three
two
one

问题是如何恢复原始单词集(我知道$n数字)。

请注意,单词“two”应该出现两次,因此sort -u file.txtsort file.txt | uniq不是答案!

2 个答案:

答案 0 :(得分:4)

此行为您提供未排序原始行:

awk -v n="2" '{a[$0]++}END{for(x in a)for(i=1;i<=a[x]/n;i++)print x}' file

n可能是变量,我使用了硬编码的2。使用当前输入文件,输出:

two
two
three
one

输出未排序,因为只有您的输入文件无法知道“原始”文件的顺序。

使用其他示例进行测试:

#still n=2
kent$  cat f  
one
one
one
one
three
three
two
two
two
two
two
two

kent$  awk -v n="2" '{a[$0]++}END{for(x in a)for(i=1;i<=a[x]/n;i++)print x}' f
three
two
two
two
one
one

#now n=4:

kent$  cat f
one
one
one
one
one
one
one
one
three
three
three
three
two
two
two
two
two
two
two
two
two
two
two
two

kent$  awk -v n="4" '{a[$0]++}END{for(x in a)for(i=1;i<=a[x]/n;i++)print x}' f
three
two
two
two
one
one

答案 1 :(得分:1)

另一个:

n=2
inp="./in"

while read -r cnt word
do
        seq -f "$word" $(( cnt / n ))
done < <(sort "$inp" | uniq -c)

打印

one
three
two
two

perl变种

perl -nE '$s{$_}++}{print "$_"x($s{$_}/2) for keys %s' < in

最后, bash(4 +)

file="./in"
div=2

declare -A w
while read -r word
do
    [[ -z "${w[$word]}" ]] && order+=($word)
    let w[$word]++
done < "$file"
for word in "${order[@]}"
do
    cnt=$(( ${w[$word]} / div ))
    for(( i=0; i < $cnt ; i++ ))
    do
        echo $word
    done
done

按照第一个在输入中找到单词的顺序打印,例如:

one
two
two
three