登录后如何进行网页抓取

时间:2019-02-13 15:53:39

标签: c# web-scraping html-agility-pack

我可以很好地登录该网页,并导航到网站“ urlCana​​da”内的页面。但是,当尝试将该信息加载到htmlCana​​da中并对其进行调试时,它将向我显示登录屏幕的html而不是导航页面的html。我想念什么吗?如果将htmlCanda从导航页面告诉GetStringAsync,为什么它会返回登录页面?

        var urlCanada = webBrowserCanada.Url;
        //Creates a client for you to store the webpage in
        var httpClientCanada = new HttpClient();
        var htmlCanada = await httpClientCanada.GetStringAsync(urlCanada);
        //Allows parsing the information out
        var htmlDocumentCanada = new HtmlAgilityPack.HtmlDocument();
        htmlDocumentCanada.LoadHtml(htmlCanada);
        //Parse the information
        var ProductsHtml = htmlDocumentCanada.DocumentNode
           .SelectSingleNode("//table[@id='tableid']")
            .Descendants("tr")
            .Skip(1)
            .Where(tr => tr.Elements("td").Count() > 1)
            .Select(tr => tr.Elements("td").Select(td => td.InnerText.Trim()).ToList())
            .ToList();

这是表格的

<table class="GridViewMFG" rules="all" id="ctl00_mainContent_GridViewIssuedParts" style="width:100%;border-collapse:collapse;" cellspacing="0" cellpadding="4" border="1">
</table>

P.S。当我调试并查看webBrowserCanada.Url时,它显示了导航网页的html。

1 个答案:

答案 0 :(得分:0)

因此,我能够轻松解决此问题。由于webBrowserCanada.Url具有我所需的信息,因此我删除了这两行代码。

        var httpClientCanada = new HttpClient();
        var htmlCanada = await httpClientCanada.GetStringAsync(urlCanada);

并替换为

        var htmlCanada = webBrowserCanada.DocumentText;

所以现在整个代码都可以读取

        var htmlCanada = webBrowserCanada.DocumentText;
        //Allows parsing the information out
        var htmlDocumentCanada = new HtmlAgilityPack.HtmlDocument();
        htmlDocumentCanada.LoadHtml(htmlCanada);
        //Parse the information
        var ProductsHtml = htmlDocumentCanada.DocumentNode
           .SelectSingleNode("//table[@id='tableid']")
            .Descendants("tr")
            .Skip(1)
            .Where(tr => tr.Elements("td").Count() > 1)
            .Select(tr => tr.Elements("td").Select(td => td.InnerText.Trim()).ToList())
            .ToList();