Question

我知道这不是将xml与sed或awk和regex匹配的最佳方法，但我在遇到此问题的环境中别无选择。

我找不到任何可以解决我问题的答案。

遵循XML内容：

<testTag name="findThisName">
    <content>...</content>
</testTag>
<testTag name="doNOTfindThisName">
   <content>...</content>
</testTag>

正则表达式应该匹配整个标记，名称为findThisName，包括所有内容。使用此正则表达式可以正常工作，但前提是内容在一行中：

<testTag name(?:(?!<\/testTag>).)*findThisName.*?<\/testTag>

任何人都知道如何用sed或awk解决这个问题？谢谢！

Answer 1

awk没有perl的所有正则表达式功能，但这可能对您有用：

$ awk '/<testTag[^>]*name="findThisName"/,/<\/testTag>/{next} 1' file
<testTag name="doNOTfindThisName">
   <content>...</content>
</testTag>

如何运作

awk允许我们在/regex1/,/regex2/形式中指定一系列行，这些行将匹配以regex1开头并以第一次出现regex2结束的行。我们使用它来跳过不需要的行：

/<testTag[^>]*name="findThisName"/,/<\/testTag>/{next}

对于以<testTag[^>]*name="findThisName"开头并以<\/testTag>开头的范围内的所有行，请跳至下一行。

选择的起始正则表达式<testTag[^>]*name="findThisName"允许testTag具有多个属性。我们不要求name="findThisName"成为第一个属性。
1

对于所有其他行，请告诉awk打印它们。 1是awk用于打印线条的神秘简写。如果您希望明确，请将其替换为{print $0}。

Answer 2

这样的东西似乎在awk中起作用。由于您提到删除此标记，因此我不打印这些行。请注意，嵌套的testTags会失败。

awk 'BEGIN {open=0} 
    $0 ~ /<testTag name="findThisName">/ {open=1}
    open==1 && $0 ~ /<\/testTag>/ {open=0; next;}
    open==1 {next;}
    open==0 {print;}'

标记何时找到开始标记，然后检查它是否在同一行结束并删除它（如果是）。如果不是，则跳过直到达到结束标记的行。在所需的标签之外，它只是打印出来。

使用此测试输入：

<testTag name="findThisName">
    <content>1<content>
</testTag>
<testTag name="doNOTfindThisName">
   <content>...</content>
</testTag>
<testTag name="findThisName">
    <content>2</content>
</testTag>
<testTag name="doNOTfindThisName">
   <content>...</content>
</testTag>
<testTag name="doNOTfindThisName">
   <content>...</content>
</testTag>
<testTag name="doNOTfindThisName">
   <content>...</content>
</testTag>
<testTag name="findThisName">
    <content>3</content>
</testTag>
<testTag name="doNOTfindThisName">
   <content>...</content>
</testTag>

它似乎按预期工作：

~$ awk 'BEGIN {open=0}
        $0 ~ /<testTag name="findThisName">/ {open=1}
        open==1 && $0 ~ /<\/testTag>/ {open=0; next;}
        open==1 {next;}
        open==0 {print;}' testxml.txt

<testTag name="doNOTfindThisName">
   <content>...</content>
</testTag>
<testTag name="doNOTfindThisName">
   <content>...</content>
</testTag>
<testTag name="doNOTfindThisName">
   <content>...</content>
</testTag>
<testTag name="doNOTfindThisName">
   <content>...</content>
</testTag>
<testTag name="doNOTfindThisName">
   <content>...</content>
</testTag>

Answer 3

$ awk '/<testTag name="findThisName"/{f=1} !f; /<\/testTag>/{f=0}' file
<testTag name="doNOTfindThisName">
   <content>...</content>
</testTag>

Answer 4

如果你不需要一个完整的正则表达式匹配，那么不要打扰一个。以下内容适用于您。

awk '
# Find the start line and set our flag.
/^<testTag name="findThisName">$/ {f=1}

# Print the line if we aren't currently in the flagged tag.
!f {print; next}

# Find the end of the flagged tag and unset our flag.
f && /^<\/testTag>$/ {f=0}
'

这对嵌套的<testTag>元素不起作用，但第一个</testTag>会触发解除阻塞。

将多行XML与正则表达式和sed或awk匹配

4 个答案:

如何运作