正则表达式删除两个字符之间的所有内容

时间:2017-12-19 10:49:54

标签: c# regex

我有以下字符串:

"<a href=\"/formentries/formfile/13978\" target=\"_blank\">dog-00.jpg|image/jpeg</a>  <a href='/FormEntries/Delete' class='btnDeleteAttachment'  data-form-entry-id='366793'  data-attachment-id='13978'> [remove]</a><br /><a href=\"/formentries/formfile/13979\" target=\"_blank\">dog-01.docx|application/vnd.openxmlformats-officedocument.wordprocessingml.document</a>  <a href='/FormEntries/Delete' class='btnDeleteAttachment'  data-form-entry-id='366793'  data-attachment-id='13979'> [remove]</a><br /><a href=\"/formentries/formfile/13980\" target=\"_blank\">dog-02.png|image/png</a>  <a href='/FormEntries/Delete' class='btnDeleteAttachment'  data-form-entry-id='366793'  data-attachment-id='13980'> [remove]</a>"

如果你要很好地格式化,你会看到类似的东西:

<a href=\"/formentries/formfile/13978\" target=\"_blank\">dog-00.jpg|image/jpeg</a>
<a href='/FormEntries/Delete' class='btnDeleteAttachment'  data-form-entry-id='366793'  data-attachment-id='13978'> [remove]</a>
<br />

<a href=\"/formentries/formfile/13979\" target=\"_blank\">dog-01.docx|application/vnd.openxmlformats-officedocument.wordprocessingml.document</a>
<a href='/FormEntries/Delete' class='btnDeleteAttachment'  data-form-entry-id='366793'  data-attachment-id='13979'> [remove]</a>
<br />

<a href=\"/formentries/formfile/13980\" target=\"_blank\">dog-02.png|image/png</a>
<a href='/FormEntries/Delete' class='btnDeleteAttachment'  data-form-entry-id='366793'  data-attachment-id='13980'> [remove]</a>

所以我有一堆锚标签,它们之间有断点。在每个锚文本中,我想删除管道字符和文件类型:

  

狗00.jpg |图像/ JPEG

变为

  

狗00.jpg

正则表达式也适用于所有未来的文件类型,例如:

  

狗01.docx |应用/ vnd.openxmlformats-officedocument.wordprocessingml.document

变为

  

狗01.docx

我仍然需要完整的锚点,因此在删除文件类型后,文本变为:

<a href=\"/formentries/formfile/13978\" target=\"_blank\">dog-00.jpg</a>
<a href='/FormEntries/Delete' class='btnDeleteAttachment'  data-form-entry-id='366793'  data-attachment-id='13978'> [remove]</a>
<br />

<a href=\"/formentries/formfile/13979\" target=\"_blank\">dog-01.docx</a>
<a href='/FormEntries/Delete' class='btnDeleteAttachment'  data-form-entry-id='366793'  data-attachment-id='13979'> [remove]</a>
<br />

我对Regex并不擅长,但我尝试了各种组合都无法匹配

3 个答案:

答案 0 :(得分:1)

不要使用正则表达式来解析复杂的HTML,您可以使用HtmlAgilityPack。我还使用ContainsIndexOfRemove等字符串方法代替正则表达式:

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html); // pass in your HTML string

foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
    string text = link.InnerText;
    if (text.Contains('|'))
        link.InnerHtml = text.Remove(text.IndexOf('|')); // you can't modify InnerText directly but this works
}

string result = doc.DocumentNode.OuterHtml; // your desired result

答案 1 :(得分:0)

输入:
dog-00.jpg|image/jpeg

仅匹配|管道前部分的正则表达式:
([^|]+)

描述:
上面的正则表达式匹配所有内容,直到出现第一个管道字符。

C#代码:

var input = @"dog-00.jpg|image/jpeg";
var regex = new Regex(@"([^|]+)");
var m = regex.Match(input);
string name = null;
if (m.Success)
{
     name = m.Groups[1].Value;
}

编辑:
如果这只是通过管道字符拆分字符串,那么带有input.Split(或.Substring + .IndexOf)的Dylan Nicholson变体可能比正则表达式更具性能。

EDIT2:
是否需要正则表达式?如果没有,请尝试以下方法:

public static string Clean(string input)
{
    var sb = new StringBuilder(input);
    int m1 = -1, m2 = -1;
    for(var i = 0; i < sb.Length; i++)
    {
        if (sb[i] == '|')
            m1 = i;
        if (sb[i] == '<')
            m2 = i;
        if (m1 > -1 && m2 > -1 && m2 > m1)
        {
            sb.Remove(m1, m2 - m1);
            i = m1;
            m1 = -1;
            m2 = -1;
        }
    }
    return sb.ToString();
}

答案 2 :(得分:0)

<强>更新

您可以使用此正则表达式:

(?<=<a[^>]*>[^|]+?)\|.*?(?=</a>)

对于C#:

 your_string = Regex.Replace(your_string, "(?<=<a[^>]*>[^|]+?)\\|.*?(?=</a>)", "",
    RegexOptions.IgnoreCase | RegexOptions.Multiline);

只需使用此正则表达式替换字符串。