Question

我正在开发一个履带式挖掘机，我需要保存一些证据证明爬山虎能胜任他的工作。

我正在寻找一种方法来下载已发送URL的所有HTML，CSS和JS，并创建目标站点的相同文件夹结构。

我必须使用Azure Functions来执行搜寻器。

想法是抓取网站，下载内容并保存在Azure Blob中。

我发现了this article，但是它只显示了如何下载HTML，我需要创建与爬虫完全相同的东西（带有图像，CSS和已处理的JS）。

我相信所有绝对路径都可以使用，真正的问题是我将创建文件夹以保存文件的相对路径。

有人可以帮我吗？

Answer 1

好吧，我相信这个答案对那些经历过与我相同经历的人会有所帮助。

我的解决方案是下载HTML（使用HttpWebRequest）并写入文件（存储在Azure Blob中）。

就我而言，我制作了一个函数来更正HTML文件中的所有亲戚路径，如下所示：

private static HtmlDocument CorrectHTMLReferencies(string urlRoot, string htmlContent)
{
    HtmlDocument document = new HtmlDocument();
    document.LoadHtml(htmlContent);
    Regex rx = new Regex(@"([\w-]+\.)+[\w-]+(\/[\w- .\/?%&=]*)?");
    var nodesIMG = document.DocumentNode.SelectNodes("//img");
    var nodesCSS = document.DocumentNode.SelectNodes("//link");
    var nodesJS = document.DocumentNode.SelectNodes("//script");
    string protocol = "http:";
    if (urlRoot.Contains(":"))
        protocol = urlRoot.Split(':')[0] + ":";
    void WatchURl(HtmlNodeCollection colNodes, string attr)
    {
        foreach (HtmlNode node in colNodes)
        {
            if (node.Attributes.Any(a => a.Name?.ToLower() == attr.ToLower()))
            {
                string link = node.Attributes[attr].Value;
                if (rx.IsMatch(link))
                {
                    if (link.Substring(0, 2) == "//")
                    {
                        string novaUrl = protocol + link;
                        node.SetAttributeValue(attr, novaUrl);
                    }
                }
                else
                {
                    node.SetAttributeValue(attr, urlRoot + link);
                }
            }
        }
    }
    WatchURl(nodesIMG, "src");
    WatchURl(nodesCSS, "href");
    WatchURl(nodesJS, "src");
    return document;
}

我只下载一个文件，而不是下载所有网站。这个对我有用） ;）

如何在c＃中下载整个网站（Azure函数）

1 个答案: