将html拆分为单词

时间:2009-05-10 13:55:59

标签: c# html split

假设我有以下字符串:

Hellotoevryone<img height="115" width="150" alt="" src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsogladtoseeall.

此字符串表示未被空格分隔的字符序列,在此字符串中还插入了html图像。现在我想将字符串分成单词,每个单词的长度为10个字符,因此输出应为:

1)Hellotoevr
2)yone<img height="115" width="150" alt="" src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsog
3)ladtoseeal
4)l.

因此,我们的想法是将任何html标记内容保留为0长度字符。

我写过这样的方法,但没有考虑到html标签:

public static string EnsureWordLength(this string target, int length)
{
    string[] words = target.Split(' ');
    for (int i = 0; i < words.Length; i++)
        if (words[i].Length > length)
        {
            var possible = true;
            var ord = 1;
            do
            {
                var lengthTmp = length*ord+ord-1;
                if (lengthTmp < words[i].Length) words[i] = words[i].Insert(lengthTmp, " ");
                else possible = false;
                ord++;
            } while (possible); 

        }

    return string.Join(" ", words);
}

我希望看到一个执行我所描述的拆分的代码。谢谢。

2 个答案:

答案 0 :(得分:3)

这是符合您要求的正则表达式解决方案。请记住,如果您决定稍微改变您的要求,这可能不会起作用,这对well known quote here忠实。

using System.Text.RegularExpressions;

string[] samples = {
    @"Hellotoevryone<img height=""115"" width=""150"" alt="""" src=""/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg"" />Iamsogladtoseeall.",
    "Testing123Hello.World",
    @"Test<a href=""http://stackoverflow.com"">StackOverflow</a>",
    @"Blah<a href=""http://stackoverflow.com"">StackOverflow</a>Blah<a href=""http://serverfault.com"">ServerFault</a>",
    @"Test<a href=""http://serverfault.com"">Server Fault</a>", // has a space, not matched
    "Stack Overflow" // has a space, not matched
};

// use these 2 lines if you don't want to use regex comments
//string pattern = @"^((?:\S(?:\<[^>]+\>)?){1,10})+$";
//Regex rx = new Regex(pattern);

// regex comments spanning multiple lines requires use of RegexOptions.IgnorePatternWhitespace
string pattern = @"^(               # match line/string start, begin group
                    (?:\S           # match (but don't capture) non-whitespace chars
                    (?:\<[^>]+\>)?  # optionally match (doesn't capture) an html <...> tag
                                    # to match img tags only change to (?:\<img[^>]+\>)?
                    ){1,10}         # match upto 10 chars (tags don't count per your example)
                    )+$             # match at least once, and match end of line/string
                    ";
Regex rx = new Regex(pattern, RegexOptions.IgnorePatternWhitespace);

foreach (string sample in samples)
{
    if (rx.IsMatch(sample))
    {
        foreach (Match m in rx.Matches(sample))
        {
            // using group index 1, group 0 is the entire match which I'm not interested in
            foreach (Capture c in m.Groups[1].Captures)
            {
                Console.WriteLine("Capture: {0} -- ({1})", c.Value, c.Value.Length);
            }
        }
    }
    else
    {
        Console.WriteLine("Not a match: {0}", sample);
    }

    Console.WriteLine();
}

使用上面的示例,这是输出(括号中的数字=字符串长度):

Capture: Hellotoevr -- (10)
Capture: yone<img height="115" width="150" alt="" src="/Content/Edt/image/b49768
75-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsog -- (116)
Capture: ladtoseeal -- (10)
Capture: l. -- (2)

Capture: Testing123 -- (10)
Capture: Hello.Worl -- (10)
Capture: d -- (1)

Capture: Test<a href="http://stackoverflow.com">StackO -- (45)
Capture: verflow</a> -- (11)

Capture: Blah<a href="http://stackoverflow.com">StackO -- (45)
Capture: verflow</a>Bla -- (14)
Capture: h<a href="http://serverfault.com">ServerFau -- (43)
Capture: lt</a> -- (6)

Not a match: Test<a href="http://serverfault.com">Server Fault</a>

Not a match: Stack Overflow

答案 1 :(得分:1)

以下代码将处理您提供的案例,但会破坏任何更复杂的案例。此外,由于您没有指定如何处理带有内部文本或HTML的长格式标签,因此它将所有标签视为短格式标签(运行代码以查看我的意思)。

使用此输入:

Hellotoevryone<img height="115" width="150" alt="" src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsogladtoseeall.
Hellotoevryone<img src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsoglad<img src="baz.jpeg" />toseeall.
Hello<span class="foo">toevryone</span>Iamso<em>glad</em>toseeallTheQuickBrown<img src="bar.jpeg" />FoxJumpsOverTheLazyDog.
Hello<span class="foo">toevryone</span>Iamso<em>glad</em>toseeall.
Loremipsumdolorsitamet,consecteturadipiscingelit.Nullamacnibhelit,quisvolutpatnunc.Donecultrices,ipsumquisaccumsanconvallis,tortortortorgravidaante,etsollicitudinipsumnequeeulorem.

打破此输入(请注意不完整的标记):

Hellotoevryone<img height="115" width="150" alt="" src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" /Iamsogladtoseeall.
using System;
using System.Text.RegularExpressions;
using System.IO;
using System.Collections.Generic;

public static class CustomSplit {
  public static void Main(String[] args) {
    if (args.Length > 0 && File.Exists(args[0])) {
      StreamReader sr = new StreamReader(args[0]);
      String[] lines = sr.ReadToEnd().Split(new String[]{Environment.NewLine}, StringSplitOptions.None);

      int counter = 0;
      foreach (String line in lines) {
        Console.WriteLine("########### Line {0} ###########", ++counter);
        Console.WriteLine(line);
        Console.WriteLine(line.EnsureWordLength(10));
      }
    }
  }

}

public static class EnsureWordLengthExtension {
  public static String EnsureWordLength(this String target, int length) {
    List<List<Char>> words = new List<List<Char>>();

    words.Add(new List<Char>());

    for (int i = 0; i < target.Length; i++) {
      words[words.Count - 1].Add(target[i]);

      if (target[i] == '<') {
        do {
          i++;
          words[words.Count - 1].Add(target[i]);
        } while(target[i] != '>');
      }

      if ((new String(words[words.Count - 1].ToArray())).CountCharsWithoutTags() == length) {
        words.Add(new List<Char>());
      }
    }

    String[] result = new String[words.Count];
    for (int j = 0; j < words.Count; j++) {
      result[j] = new String(words[j].ToArray());
    }

    return String.Join(" ", result);
  }

  private static int CountCharsWithoutTags(this String target) {
    return Regex.Replace(target, "<.*?>", "").Length;
  }
}