Question

我有以下字符串，我想删除<bpt *>*</bpt>和<ept *>*</ept>（注意其中需要删除的其他标记内容），而不使用XML解析器（开销太大，无法使用小字符串）。

The big <bpt i="1" x="1" type="bold"><b></bpt>black<ept i="1"></b></ept> <bpt i="2" x="2" type="ulined"><u></bpt>cat<ept i="2"></u></ept> sleeps.

VB.NET或C＃中的任何正则表达式都可以。

Answer 1

如果您只想删除字符串中的所有标记，请使用此（C＃）：

try {
    yourstring = Regex.Replace(yourstring, "(<[be]pt[^>]+>.+?</[be]pt>)", "");
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

编辑：

我决定用更好的选择添加我的解决方案。如果有嵌入式标签，则前一个选项不起作用。这个新的解决方案应该删除所有＆lt; ** pt *＆gt;标签，嵌入或不嵌入。此外，此解决方案使用对原始[be]匹配的反向引用，以便找到完全匹配的结束标记。此解决方案还创建了一个可重用的Regex对象，以提高性能，以便每次迭代都不必重新编译正则表达式：

bool FoundMatch = false;

try {
    Regex regex = new Regex(@"<([be])pt[^>]+>.+?</\1pt>");
    while(regex.IsMatch(yourstring) ) {
        yourstring = regex.Replace(yourstring, "");
    }
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

附加说明：

在评论中，用户表示担心'。'模式匹配器将是cpu密集型。虽然在独立贪婪的'。'的情况下也是如此，但是使用非贪婪的字符'？'导致正则表达式引擎只向前看，直到它找到模式中下一个字符与贪婪''的第一个匹配。这要求引擎一直向前看到字符串的末尾。我使用RegexBuddy作为正则表达式开发工具，它包含一个调试器，可以让您查看不同正则表达式模式的相对性能。如果需要，它还会自动评论你的正则表达式，所以我决定在这里包含这些注释来解释上面使用的正则表达式：

    // <([be])pt[^>]+>.+?</\1pt>
// 
// Match the character "<" literally «<»
// Match the regular expression below and capture its match into backreference number 1 «([be])»
//    Match a single character present in the list "be" «[be]»
// Match the characters "pt" literally «pt»
// Match any character that is not a ">" «[^>]+»
//    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the character ">" literally «>»
// Match any single character that is not a line break character «.+?»
//    Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
// Match the characters "</" literally «</»
// Match the same text as most recently matched by backreference number 1 «\1»
// Match the characters "pt>" literally «pt>»

Answer 2

我认为你想完全放弃标签？

(<bpt .*?>.*?</bpt>)|(<ept .*?>.*?</ept>)

？ *之后使它变得非贪婪，所以它会尝试匹配尽可能少的字符。

您将遇到的一个问题是嵌套标签。东西不会看到第二个，因为第一个匹配。

Answer 3

为什么你说开销太大了？你测量过了吗？或者你在猜？

当有人出现类似＆lt; bpt foo =“bar＆gt;”＆gt;

之类的东西时，使用正则表达式而不是正确的解析器是一种捷径。

Answer 4

.NET正则表达式引擎是否支持负面预测？如果是，那么你可以使用

(<([eb])pt[^>]+>((?!</\2pt>).)+</\2pt>)

如果删除所有匹配项，大黑猫会在上面的字符串中停留。。但请注意，如果您有嵌套的bpt / ept元素，它将无效。您可能还希望在某些位置添加\s以允许在关闭元素等中添加额外的空格。

Answer 5

如果您要使用正则表达式删除XML元素，最好确保输入XML不使用来自不同命名空间的元素，或者包含您不想修改其内容的CDATA部分。

使用XSLT正确（即，高效和正确）的方式。将除特定元素之外的所有内容复制到输出的XSLT转换是身份转换的一个简单扩展。编译转换后，它将非常快速地执行。它不会包含任何隐藏的缺陷。

Answer 6

是否有任何可能的方法来获取regex.pattern的xml类型文本的全局解决方案？这样我将摆脱替换功能和shell使用正则表达式。麻烦的是分析＆lt; ＆GT;是否有秩序.. 还将保留的字符替换为'＆amp;等等。这是代码 '处理特殊字符功能朋友函数ReplaceSpecChars（ByVal str As String）As String Dim arrLessThan作为新系列 Dim arrGreaterThan作为新系列如果不是IsDBNull（str）那么

  str = CStr(str)
  If Len(str) > 0 Then
    str = Replace(str, "&", "&amp;")
    str = Replace(str, "'", "&apos;")
    str = Replace(str, """", "&quot;")
    arrLessThan = FindLocationOfChar("<", str)
    arrGreaterThan = FindLocationOfChar(">", str)
    str = ChangeGreaterLess(arrLessThan, arrGreaterThan, str)
    str = Replace(str, Chr(13), "chr(13)")
    str = Replace(str, Chr(10), "chr(10)")
  End If
  Return str
Else
  Return ""
End If

结束功能朋友功能ChangeGreaterLess（ByVal lh As Collection，ByVal gr As Collection，ByVal str As String）As String 对于i As Integer = 0到lh.Count 如果CInt（lh.Item（i））＆gt; CInt（gr.Item（i））然后 str =替换（str，“＆lt;”，“＆lt;”）/////////问题//// 结束如果

  Next


    str = Replace(str, ">", "&gt;")

结束功能朋友函数FindLocationOfChar（ByVal chr As Char，ByVal str As String）As Collection Dim arr As New Collection 对于i As Integer = 1到str.Length（） - 1 如果str.ToCharArray（i，1）= chr那么 arr.Add（ⅰ）万一下一个返回arr 结束功能

在问号上遇到麻烦

这是一个标准的xml，我想分析不同的标签..

Answer 7

你测量过这个吗？我已经使用.NET的正则表达式引擎遇到了性能问题，但相比之下，使用Xml解析器解析了大约40GB 的xml文件而没有问题（您需要使用XmlReader但是，更大的字符串。

请发布实际代码示例并提及您的性能要求：如果性能很重要，我怀疑Regex类是最佳解决方案。

正则表达式，用于删除XML标记及其内容

7 个答案: