我是网络抓取新手。我试图抓一个网页,但无法这样做。我试图添加不同的标题。这是我的代码:
private static async void GetPersonHtmlAsync()
{
var url = "http://www.lrs.lt/sip/portal.show?p_r=8801&p_k=1&p_a=498&p_asm_id=51970";
string html = await GetPageAsStringAsync(url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
Console.WriteLine(htmlDocument.ParsedText);
var HeadHtml = htmlDocument.DocumentNode.Descendants("head").ToList();
var Name = HeadHtml[0].Descendants("title").FirstOrDefault().InnerText;
Console.WriteLine(Name);
}
public static async Task<string> GetPageAsStringAsync(string url)
{
HttpClient x = new HttpClient();
x.DefaultRequestHeaders.Add("user-agent",
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)");
HttpResponseMessage response = await x.GetAsync(url);
string content = await response.Content.ReadAsStringAsync();
return content;
}
这是回复:
<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=9&xinfo=10-4507639-0%202CNN%20RT%281518808565894%2010%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c316%2c0%29&incident_id=723000330015808054-22736350026596890&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 723000330015808054-22736350026596890</iframe></body></html>
网页似乎使用阻止机器人请求的服务。我已经搜索了一个解决方案,我能找到的唯一建议是更改标题,以便我的呼叫似乎来自浏览器。但它没有用。