Question

在以下文章中，我按照示例从webServer目录创建了我的httprequest和列表文件： C# HttpWebRequest command to get directory listing

我正在尝试使用该示例列出来自我的Web服务器的文件。我可以列出链接上引用的示例服务器中的文件，但我的服务器只显示最后添加的文件。我的代码就像那里的例子。我注意到我的HTML代码有点不同。有人有个主意：

<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>186.215.156.154 - /download/Zatix/Zatix - Satisfação Geral/</title>
</head>
<body>
    <h1>
        186.215.156.154 - /download/Zatix/Zatix - Satisfação Geral/</h1>
    <hr>
    <pre>
    <a href="/download/Zatix/">[Para a pasta superior]</a>
    <br>
    <br>
    sexta-feira, 19 de novembro de 2010    11:17        52355 <a href="/download/Zatix/Zatix%20-%20Satisfa%C3%A7%C3%A3o%20Geral/Zatix%20-%20Satisfa%C3%A7%C3%A3o%20Geral_3_00.zip">Zatix - Satisfação Geral_3_00.zip</a><br>sexta-feira, 19 de novembro de 2010    11:17        52355 <a href="/download/Zatix/Zatix%20-%20Satisfa%C3%A7%C3%A3o%20Geral/Zatix%20-%20Satisfa%C3%A7%C3%A3o%20Geral_4_00.zip">Zatix - Satisfação Geral_4_00.zip</a>
    <br>
</pre>
    <hr>
</body>
</html

我想我必须在GetDirectoryListingRegexForUrl方法的回归中改变一些东西。

我的代码是这样的：

private string GetDirectoryListingRegexForUrl(string url)
{
    if (url.Equals(Url));
    {
        return "<A HREF=\".*\">(?<name>.*)</A>";                   
    }
    throw new NotSupportedException();
}

public void ListStudies()
{
    Url = BaseUrl + this.clientName + "/" + this.activeStudy + "/";
    Console.WriteLine(Url);
    CookieContainer cookies;
    HttpWebResponse response;
    HttpWebRequest req = (HttpWebRequest)System.Net.WebRequest.Create(Url);            

    req.Credentials = _NetworkCredential;
    req.CookieContainer = new CookieContainer();
    req.AllowAutoRedirect = true;
    cookies = req.CookieContainer;

    try
    {
        response = (HttpWebResponse)req.GetResponse();

        if (response.StatusCode != HttpStatusCode.OK)
            Console.WriteLine("URL NÃO RESPONDEU");
        else
            Console.WriteLine("URL OK");

        using (response)
        {
            using (StreamReader reader = new StreamReader(response.GetResponseStream()))
            {
                string html = reader.ReadToEnd();
                Regex regex = new Regex(GetDirectoryListingRegexForUrl(Url));
                MatchCollection matches = regex.Matches(html);                                             

                if (matches.Count > 0)
                {
                    foreach (Match match in matches)
                    {
                        if (match.Success)
                        {
                            Console.WriteLine(match.Groups["name"]);                                    
                        }                                
                    }
                }
            }
        }
    }
    catch (Exception e)
    {
        MessageBox.Show(e.Message, "Update Error", MessageBoxButtons.OK, MessageBoxIcon.Error);
    }            
}

我希望你能帮助我！感谢。

Answer 1

这里有两个主要问题。

1）。像这样的请求的输出完全是任意的，甚至不能保证。这是服务器的关注点。

2）。正则表达式是not a suitable means用于解析HTML或任何类似的结构，因为它不是常规语法。假设您的响应中有任何可靠性，最好的选择是依靠HtmlAgilityPack之类的东西来强制执行严格的XHTML文档（如果幸运的话可能不需要）并将其作为XML读取带有XPath查询的文档，用于提取您感兴趣的内容。

Answer 2

这是正确的正则表达式：

<A HREF=\".*?\">(?<name>.*?)</A>

将其与原始版本进行比较：

<A HREF=\".*\">(?<name>.*)</A>

问题在于重复运算符.*默认情况下是贪婪的。贪婪意味着正在寻找匹配时，正则表达式将尽可能扩展。这意味着它将从第一个<A开始，并以字符串中的最后一个A>结束，让所有内容都介于其中。“一切”包括中间的其他<A...A>。< / p>

您需要指定重复运算符是惰性的。您可以通过向?添加.*?来完成此操作。

P.S。用正则表达式解析HTML是一个糟糕的主意。如果你需要一个快速而肮脏的解决方案，但是不需要长期解决方案，那也没关系。除此之外，在您的情况下，输出将根据服务器和每服务器版本而变化。代码不具有普遍功能。请考虑另一种方法，例如直接与服务器协商以获取目录列表（如果您当然有访问权限）。

最后一些有趣的读到了thema：

Parsing Html The Cthulhu Way

RegEx match open tags except XHTML self-contained tags

HttpWebRequest命令获取目录列表

2 个答案: