Question

我有几句话：

one
two
two
three

我有一个文件，其中每个单词重复n次。例如，在n = 2时，给定文件为：

one
two
two
three
two
three
two
one

问题是如何恢复原始单词集（我知道$n数字）。

请注意，单词“two”应该出现两次，因此sort -u file.txt或sort file.txt | uniq不是答案！

Answer 1

此行为您提供未排序原始行：

awk -v n="2" '{a[$0]++}END{for(x in a)for(i=1;i<=a[x]/n;i++)print x}' file

n可能是变量，我使用了硬编码的2。使用当前输入文件，输出：

two
two
three
one

输出未排序，因为只有您的输入文件无法知道“原始”文件的顺序。

使用其他示例进行测试：

#still n=2
kent$  cat f  
one
one
one
one
three
three
two
two
two
two
two
two

kent$  awk -v n="2" '{a[$0]++}END{for(x in a)for(i=1;i<=a[x]/n;i++)print x}' f
three
two
two
two
one
one

#now n=4:

kent$  cat f
one
one
one
one
one
one
one
one
three
three
three
three
two
two
two
two
two
two
two
two
two
two
two
two

kent$  awk -v n="4" '{a[$0]++}END{for(x in a)for(i=1;i<=a[x]/n;i++)print x}' f
three
two
two
two
one
one

Answer 2

另一个：

n=2
inp="./in"

while read -r cnt word
do
        seq -f "$word" $(( cnt / n ))
done < <(sort "$inp" | uniq -c)

打印

one
three
two
two

perl变种

perl -nE '$s{$_}++}{print "$_"x($s{$_}/2) for keys %s' < in

最后，纯 bash（4 +）

file="./in"
div=2

declare -A w
while read -r word
do
    [[ -z "${w[$word]}" ]] && order+=($word)
    let w[$word]++
done < "$file"
for word in "${order[@]}"
do
    cnt=$(( ${w[$word]} / div ))
    for(( i=0; i < $cnt ; i++ ))
    do
        echo $word
    done
done

按照第一个在输入中找到单词的顺序打印，例如：

one
two
two
three

如何在Linux shell中删除文件中的n次重复行？

2 个答案:

使用其他示例进行测试：