Question

我有一个由多个<div>元素组成的网页。

我想编写一个程序，在<div>标题之后打印<h4>内的所有li元素。任何人都可以给我一些帮助或示例代码吗？

<div id="content">
    <h4>Header</h4>
    <ul>
        <li><a href...></a> THIS IS WHAT I WANT TO GET</li>
    </ul>
</div>

Answer 1

在C＃中解析HTML时，不要尝试编写自己的HTML。 HTML Agility Pack几乎可以肯定能够做你想做的事情！

哪些部分是不变的：

DIV中的'id'？
h4

搜索完整的HTML文档并单独对H4做出反应可能是一团糟，而如果您知道DIV的ID为“内容”，那么只需查看它！

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(yourHtml);

if ( doc.DocumentNode != null )
{
   var divs = doc.DocumentNode
                 .SelectNodes("//div")
                 .Where(e => e.Descendants().Any(e => e.Name == "h4"));

   // You now have all of the divs with an 'h4' inside of it.

   // The rest of the element structure, if constant needs to be examined to get
   // the rest of the content you're after.
}

Answer 2

如果它是一个网页，为什么你需要做HTML解析。您用于构建网页的技术不会允许访问页面的所有元素。例如，如果您使用的是ASP.NET，则可以将id分配给UL和LI（使用runat服务器标记），并且它们可以在代码后面使用吗？

你能解释一下你的情景吗？如果您尝试发出Web请求，请将html下载为字符串，然后删除HTML将有意义

修改认为这应该工作

HtmlDocument doc = new HtmlDocument(); doc.Load(myHtmlFile); foreach (HtmlNode p in doc.DocumentNode.SelectNodes("//div")) { if(p.Attributes["id"].Value == "content") { foreach(HtmlNode child in p.ChildNodes.SelectNodes("//ul")) { if(p.PreviousSibling.InnerText() == "Header") { foreach(HtmlNode liNodes in p.ChildNodes) { //liNodes represent all childNode } } } }

Answer 3

如果你想要的只是<li></li>标签下所有<div id="content">标签之间的内容，并且紧跟在<h4>标签之后，那么这就足够了：

//Load your document first.
//Load() accepts a Stream, a TextReader, or a string path to the file on your computer
//If the entire document is loaded into a string, then use .LoadHtml() instead.
HtmlDocument mainDoc = new HtmlDocument();
mainDoc.Load("c:\foobar.html");


//Select all the <li> nodes that are inside of an element with the id of "content"
// and come directly after an <h4> tag.
HtmlNodeCollection processMe = mainDoc.GetElementbyId("content")
                                      .SelectNodes("//h4/following-sibling::*[1]//li");

//Iterate through each <li> node and print the inner text to the console
foreach (HtmlNode listElement in processMe)
{
    Console.WriteLine(listElement.InnerText);
}

使用C＃</div> </li>从某个<div>中获取所有<li>元素

3 个答案: