Question

I'd like to create a script with any combination of bash, sed, awk, or perl that deletes the newline character of a line if the next line is less than a certain length. Let's say we want to delete the newline character if the next line is less than 5 characters. If we have this source text file:

hi hi hi hi hi
bye
fun fun fun fun fun
batman
shirt shirt shirt
pants pants pants
belt
paper paper paper

Here's the desired output:

hi hi hi hi hibye
fun fun fun fun fun
batman
shirt shirt shirt
pants pants pantsbelt
paper paper paper

Here's a script that identifies all the lines that are less than 5 characters:

cat source.txt | awk 'length($0) < 5 { print NR }'

It returns this.

2
7

Here's a script that gets rid of the newlines (it's the line numbers from the previous script minus one):

perl -pe 'chomp if $.==1||$.==6' source.txt

How do I combine these two scripts? Or is there a better way to solve this?

Update

There were multiple correct answers (some didn't work on my Mac, but I think they'd work on other machines). Here's how long the correct answers took on my machine with a 769,811 line CSV file (40,000 lines had the newline character removed).

Ed Morton's awk solution: 23.7 seconds
wolfrevokcats perl with slurp: 4.5 seconds
John1024's solution didn't work on my Mac (but think it works on other OSs)
ikegami's perl without slurp: Killed the task after 7 minutes

Answer 1

就像在生活中一样，在软件中，根据已发生的事情而不是将要发生的事情来做事情要容易得多。如果NEXT行包含Y，不要想任何问题需要做X，如果CURRENT行包含Y则认为它需要做Z然后解决方案总是简单明了，例如：

$ cat tst.awk
NR>1{ printf "%s%s", prev, (length() < 5 ? "" : ORS) }
{ prev = $0 }
END{ print prev }

$ awk -f tst.awk file
hi hi hi hi hibye
fun fun fun fun fun
batman
shirt shirt shirt
pants pants pantsbelt
paper paper paper

在上面我们打印一个换行符，如果CURRENT行长度是5或更多。它清晰简单，可以在任何UNIX机器上的任何shell中使用任何awk。

Answer 2

perl -p0777e "s{\r?\n(?=.{0,5}$)}{}mg" test.txt

<强>输出

hi hi hi hi hibye
fun fun fun fun fun
batman
shirt shirt shirt
pants pants pantsbelt
paper paper paper

[我花了2分钟写了一行，大约一个小时来解释。 ]

以下是解释：

<强>开关

-p - 读取输入文件的每一行，为每一行运行-e指定的代码，并打印变量$_（由{{1}修改代码）

-e - 输入行分隔符;如果我们指定0777，则整个文件将被视为一行并立即读取

-0[octal number] - 从结尾-l剥离输入行，将\n设为等于output line separator。（我删除了它，因为这里实际上不需要它）

现在是正则表达式：

input line separator

s{\r?\n(?=.{0,5}$)}{}mg - 在变量s{pattern}{replacement}中搜索pattern并将其替换为$_

模式部分：

replacement - 匹配每个换行符号。对于Unix \r?\n就足够了，\n - CR的可选匹配，对于Windows下的旧perl版本可能是必需的。实际上我认为\r?也可以删除。

\r? - (?=pattern)的正向前瞻匹配，零宽度匹配，即不消耗字符。

pattern - 匹配从零到五个以

结尾的字符

.{0,5}$运算符修饰符：s{}{} - 多行匹配，使m在文本中的 $之前匹配，而不仅仅是在最后这条线。 \n - 全局匹配，替换文本中的每个匹配项。

最后，它是如何运作的：

Perl啜饮整个文件（g）和（-0777），然后搜索-p的每次出现，后面跟着不超过5个非换行符和换行：\r?\n。
每个匹配项都替换为空字符串(?=.{0,5}$)。

我想我已经足够清楚了。

可以从以下网址获取更多信息：{}，perldoc perlre，perldoc perlop。

Answer 3

如果你想避免啜饮并希望向前看，一般的解决方案是缓冲尽可能多的线条，你想要向前看。在这种情况下就是一个。

perl -ne'
   chomp;
   if (length >= 5) {
      print "$buf\n";
   } else {
      print $buf;
   }

   $buf .= $_;

   END { print "$buf\n" if defined $buf; }
'

在这种特殊情况下，您可以使用以下内容：

perl -pe'chomp; print "\n" if length >= 5 && $. > 1; END { print "\n" if $. }'

这两个解决方案都处理最后一行没有换行的输入。

有关用法，请参阅Specifying file to process to Perl one-liner。

Answer 4

sed也适用于简单的替换，例如：

$ sed -E ':a; N; s/\n(.{,4})$/\1/; ba' source
hi hi hi hi hibye
fun fun fun fun fun
batman
shirt shirt shirt
pants pants pantsbelt
paper paper paper

工作原理：

:a

这定义了标签a。
N

这将读取下一行并将其（带换行符）附加到模式空间的当前内容。
s/\n(.{,4})$/\1/

如果在当前行结束前的4个字符内出现换行符，则删除换行符
ba

如果上述替换命令导致对该行的更改，则跳回标签a。

BSD / Mac系统

以上用GNU sed测试。对于BSD / macOS sed，请尝试：

sed -E -e :a -e N -e 's/\n(.{,4})$/\1/' -e ba source

Answer 5

你可以试试这个sed（OpenBSD上的确定）

sed -e '$b' -e 'N;/\n...../{P;D' -e '};y/\n/ /;s/ \([^ ]*$\)/\1/' infile

Delete newline character in text file if next line is less than a certain length

5 个答案:

BSD / Mac系统