从网站下载数据后文本损坏

时间:2016-10-18 18:51:28

标签: c#

所以我需要从网站下载内容并将其放在richTextBox上。问题是,当我下载内容并使用正则表达式过滤它时,会出现损坏的文本。我怎么能解决它。以下是我的代码:

String website = "https://www.basketnews.lt/news-102294-nba-klubu-vadovai-finalas-nesikeis-mvp-iskovos-jamesas.html";

MyWebClient webClientObj = new MyWebClient();
webClientObj.Encoding = System.Text.Encoding.UTF8;
String data = webClientObj.DownloadString(website); 

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(data);

foreach (HtmlAgilityPack.HtmlNode node2 in doc.DocumentNode.SelectNodes("//div[@class= 'text']//p")) 
{
  string content = node2.InnerText;
  this.richTextBox1.AppendText('\t' + content + '\n');
}

我希望它看起来像:

Desktop app with Richtext box with extracted text from the page - expected

目前它看起来像这样:

Desktop app with Richtext box with extracted text from the page - wrong

1 个答案:

答案 0 :(得分:2)

该文本包含html编码的部分。通过HtmlDecode

运行它
var content = System.Web.HttpUtility.HtmlDecode(node2.innerText);