如何仅使用sed输出捕获的组?

时间:2010-05-06 00:04:41

标签: regex sed

有没有办法告诉sed仅输出捕获的群组?例如,给出输入:

This is a sample 123 text and some 987 numbers

和模式:

/([\d]+)/

我是否可以通过反向引用格式化获得123和987输出?

9 个答案:

答案 0 :(得分:279)

让这一点发挥作用的关键是告诉sed排除您不想输出的内容以及指定您想要的内容。

string='This is a sample 123 text and some 987 numbers'
echo "$string" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'

这说:

  • 不要默认打印每一行(-n
  • 排除零个或多个非数字
  • 包含一个或多个数字
  • 排除一个或多个非数字
  • 包含一个或多个数字
  • 排除零个或多个非数字
  • 打印替换(p

通常,在sed中,您使用括号捕获组并使用后引用输出您捕获的组:

echo "foobarbaz" | sed 's/^foo\(.*\)baz$/\1/'

将输出“bar”。如果对扩展正则表达式使用-r(OS {X为-E),则不需要转义括号:

echo "foobarbaz" | sed -r 's/^foo(.*)baz$/\1/'

最多可以有9个捕获组及其反向引用。后引用按组显示的顺序编号,但它们可以按任何顺序使用,并且可以重复:

echo "foobarbaz" | sed -r 's/^foo(.*)b(.)z$/\2 \1 \2/'

输出“a bar a”。

如果您有GNU grep(它也可以在BSD中运行,包括OS X):

echo "$string" | grep -Po '\d+'

或变体,例如:

echo "$string" | grep -Po '(?<=\D )(\d+)'

-P选项启用Perl兼容正则表达式。请参阅man 3 pcrepatternman 3 pcresyntax

答案 1 :(得分:51)

Sed最多有九种记忆模式,但您需要使用转义括号来记住正则表达式的部分内容。

有关示例和更多详细信息,请参阅here

答案 2 :(得分:29)

你可以使用grep

grep -Eow "[0-9]+" file

答案 3 :(得分:8)

我认为问题中给出的模式仅作为示例,目标是匹配 任何 模式。

如果您的GNU扩展名为 sed ,允许在模式空间中插入换行符,则有一条建议是:

> set string = "This is a sample 123 text and some 987 numbers"
>
> set pattern = "[0-9][0-9]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
123
987
> set pattern = "[a-z][a-z]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
his
is
a
sample
text
and
some
numbers

使用CYGWIN,这些示例包含tcsh(是的,我 知道 错误的shell)。 (编辑:对于bash,删除set,以及=周围的空格。)

答案 4 :(得分:7)

run(s) of digits

This answer works with any count of digit groups. Example:

$ echo 'Num123that456are7899900contained0018166intext' |
> sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'
123 456 7899900 0018166

Expanded answer.

Is there any way to tell sed to output only captured groups?

Yes. replace all text by the capture group:

$ echo 'Number 123 inside text' | sed 's/[^0-9]*\([0-9]\{1,\}\)[^0-9]*/\1/'
123

s/[^0-9]*                           # several non-digits
         \([0-9]\{1,\}\)            # followed by one or more digits
                        [^0-9]*     # and followed by more non-digits.
                               /\1/ # gets replaced only by the digits.

Or with extended syntax (less backquotes and allow the use of +):

$ echo 'Number 123 in text' | sed -E 's/[^0-9]*([0-9]+)[^0-9]*/\1/'
123

To avoid printing the original text when there is no number, use:

$ echo 'Number xxx in text' | sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1/p'
  • (-n) Do not print the input by default.
  • (/p) print only if a replacement was done.

And to match several numbers (and also print them):

$ echo 'N 123 in 456 text' | sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1 /gp'
123 456

That works for any count of digit runs:

$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" | sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'
123 456 7899900 0018166

Which is very similar to the grep command:

$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" | grep -Po '\d+'
123
456
7899900
0018166

About \d

and pattern: /([\d]+)/

Sed does not recognize the '\d' (shortcut) syntax. The ascii equivalent used above [0-9] is not exactly equivalent. The only alternative solution is to use a character class: '[[:digit:]]`.

The selected answer use such "character classes" to build a solution:

$ str='This is a sample 123 text and some 987 numbers'
$ echo "$str" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'

That solution only works for (exactly) two runs of digits.

Of course, as the answer is being executed inside the shell, we can define a couple of variables to make such answer shorter:

$ str='This is a sample 123 text and some 987 numbers'
$ d=[[:digit:]]     D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D+($d+)$D*/\1 \2/p"

But, as has been already explained, using a s/…/…/gp command is better:

$ str='This is 75577 a sam33ple 123 text and some 987 numbers'
$ d=[[:digit:]]     D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D*/\1 /gp"
75577 33 123 987

That will cover both repeated runs of digits and writing a short(er) command.

答案 5 :(得分:5)

尝试

sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"

我在cygwin下得到了这个:

$ (echo "asdf"; \
   echo "1234"; \
   echo "asdf1234adsf1234asdf"; \
   echo "1m2m3m4m5m6m7m8m9m0m1m2m3m4m5m6m7m8m9") | \
  sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"

1234
1234 1234
1 2 3 4 5 6 7 8 9
$

答案 6 :(得分:2)

这不是OP要求的(捕获组),但您可以使用以下方法提取数字:

S='This is a sample 123 text and some 987 numbers'
echo "$S" | sed 's/ /\n/g' | sed -r '/([0-9]+)/ !d'

给出以下内容:

123
987

答案 7 :(得分:0)

您可以使用 ripgrep,它似乎也是简单替换的 sed 替代品,就像这样

rg '(\d+)' -or '$1'

其中 ripgrep 使用 -o--only matching-r--replace 仅输出带有 $1 的第一个捕获组(引用以避免解释为由于两次匹配,shell 变量)两次。

答案 8 :(得分:0)

我想举一个更简单的例子,说明“只用 sed 输出捕获的组”

我有 /home/me/myfile-99 并希望输出文件的序列号:99

我第一次尝试,但没有成功:

echo "/home/me/myfile-99" | sed -r 's/myfile-(.*)$/\1/'
# output: /home/me/99

为了完成这项工作,我们还需要在捕获组中捕获不需要的部分:

echo "/home/me/myfile-99" | sed -r 's/^(.*)myfile-(.*)$/\2/'
# output: 99

*) 请注意 sed 没有 \d