Question

我在做一个小爱好项目。我已经编写了代码来获取URL，下载标题并返回mime类型/内容类型。

然而，在此之前的步骤是我坚持的步骤 - 我需要检索页面上基于标签的所有网址的内容，并在引号中，即

...
<link rel='shortcut icon' href="/static/favicon.ico" type="image/x-icon" />
...

会找到favicon链接。

.net库中是否有任何帮助，或者这必须是正则表达式的一个案例？

Answer 1

我会考虑使用Html Agility Pack。

以下是他们的示例页面中有关如何查找页面中所有链接的示例：

 HtmlWeb hw = new HtmlWeb();
 HtmlDocument doc = hw.Load(/* url */);
 foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
 {

 }

Answer 2

您需要使用HTML Agility Pack。

例如：

var doc = new HtmlWeb().Load(url);
var linkTags = doc.DocumentNode.Descendants("link");
var linkedPages = doc.DocumentNode.Descendants("a")
                                  .Select(a => a.GetAttributeValue("href", null))
                                  .Where(u => !String.IsNullOrEmpty(u));

Answer 3

BCL中没有任何内置功能，但幸运的是，您可以使用HTML Agility Pack完成此任务。

至于您的具体问题，请参阅Easily extracting links from a snippet of html with HtmlAgilityPack：

private List<string> ExtractAllAHrefTags(HtmlDocument htmlSnippet)
{
    List<string> hrefTags = new List<string>();

    foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]"))
    {
        HtmlAttribute att = link.Attributes["href"];
        hrefTags.Add(att.Value);
    }

    return hrefTags;
}

Answer 4

Regex怎么样？

<(a|link).*?href=(\"|')(.+?)(\"|').*?>

带有标记IgnoreCase和SingleLine

的

请参阅systemtextregularexpressions.com regex.matches

上的演示

获取html页面上的所有链接？

4 个答案: