Question

我希望使用c＃从单词的长摘要（普通字符串或html）中获取前几个单词（100或200）。

我的要求是显示内容的长摘要的简短描述（此内容可能包含html元素）。我能够检索纯字符串，但是当它是html时，元素在Example之间切割，我得到这样的

<span style="FONT-FAMILY: Trebuchet MS">Heading</span>
</H3><span style="FONT-FAMILY: Trebuchet MS">
<font style="FONT-SIZE: 15px;

但它应该返回带有完整html元素的字符串。

我有一个Yahoo UI编辑器来获取用户的内容，我将该文本传递给下面的方法以获得简短摘要，

public static string GetFirstFewWords(string input, int numberWords)
{
     if (input.Split(new char[] { ' ' }, 
           StringSplitOptions.RemoveEmptyEntries).Length > numberWords)
        {
            // Number of words we still want to display.
            int words = numberWords;
            // Loop through entire summary.
            for (int i = 0; i < input.Length; i++)
            {
                // Increment words on a space.
                if (input[i] == ' ')
                {
                    words--;
                }
                // If we have no more words to display, return the substring.
                if (words == 0)
                {
                    return input.Substring(0, i);
                }
            }
            return string.Empty;
        }
        else
        {
            return input;
        }
}

我正在尝试从用户那里获取文章内容，并在列表页面上显示简短摘要。

Answer 1

两个选项：

构建代码以正确执行此操作 - 计算除html标记之外的单词，将开始标记推送到堆栈，然后当达到阈值时，从堆栈中弹出闭合标记并将结束标记附加到字符串的末尾。

亲：完全控制，能够准确获得N个可见单词 con：干净利落地实施起来有点棘手。
剪切单词，然后将破碎的HTML提供给HtmlAgilityPack（可以帮助修复损坏的HTML的免费下载），然后你就可以了。

亲：几乎没有编码，经过验证的解决方案，可维护的 con：当你进行.Substring()调用

Answer 2

考虑让Html Agility Pack进行出价？

虽然不完美，但这里有一个想法可以实现（或多或少）你所追求的目标：

// retrieve a summary of html, with no less than 'max' words
string GetSummary(string html, int max)
{
    string summaryHtml = string.Empty;

    // load our html document
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);

    int wordCount = 0;


    foreach (var element in htmlDoc.DocumentNode.ChildNodes)
    {
        // inner text will strip out all html, and give us plain text
        string elementText = element.InnerText;

        // we split by space to get all the words in this element
        string[] elementWords = elementText.Split(new char[] { ' ' });

        // and if we haven't used too many words ...
        if (wordCount <= max)
        {
            // add the *outer* HTML (which will have proper 
            // html formatting for this fragment) to the summary
            summaryHtml += element.OuterHtml;

            wordCount += elementWords.Count() + 1;
        }
        else 
        { 
            break; 
        }
    }

    return summaryHtml;
}

Answer 3

您应该将内容和标记分开。你能提供更多关于你想做什么的信息吗？（例如，这个字符串来自哪里，你为什么要这样做）。

从长摘要中获取前几个单词（纯字符串或HTML）

3 个答案: