字符编码下载网页

时间:2015-04-16 15:42:32

标签: c# html encoding character-encoding

我要将网页下载到文本文件并分析单词。

他们是在不同的聊天集,iso-8859-1,windows-1252 ......我尝试了几个来自SO thisthis等的解决方案,但它们没有用,我还在读mí nimo(当然没有空格)我应该读mínimo或M& e acute; xico

有人可以让我走上正确的道路吗?谢谢!

public static string DownloadString(string address)
{
    string strWebPage = "";
    // create request
    System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(address);
    // get response
    System.Net.HttpWebResponse objResponse;
    objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();
    // get correct charset and encoding from the server's header
    string Charset = objResponse.CharacterSet;
    Encoding encoding = Encoding.GetEncoding(Charset);

    // read response into memory stream
    MemoryStream memoryStream;
    using (Stream responseStream = objResponse.GetResponseStream())
    {
        memoryStream = new MemoryStream();

        byte[] buffer = new byte[1024];
        int byteCount;
        do
        {
            byteCount = responseStream.Read(buffer, 0, buffer.Length);
            memoryStream.Write(buffer, 0, byteCount);
        } while (byteCount > 0);
    }

    // set stream position to beginning
    memoryStream.Seek(0, SeekOrigin.Begin);

    StreamReader sr = new StreamReader(memoryStream, encoding);
    strWebPage = sr.ReadToEnd();

    // Check real charset meta-tag in HTML
    int CharsetStart = strWebPage.IndexOf("charset=");
    if (CharsetStart > 0)
    {
        CharsetStart += 8;
        int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', '\"', ';' }, CharsetStart);
        string RealCharset =
               strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart);

        // real charset meta-tag in HTML differs from supplied server header???
        if (RealCharset != Charset)
        {
            // get correct encoding
            Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset);

            // reset stream position to beginning
            memoryStream.Seek(0, SeekOrigin.Begin);

            // reread response stream with the correct encoding
            StreamReader sr2 = new StreamReader(memoryStream, CorrectEncoding);

            strWebPage = sr2.ReadToEnd();
            // Close and clean up the StreamReader
            sr2.Close();
        }
    }

    // dispose the first stream reader object
    sr.Close();

    return strWebPage;
}

1 个答案:

答案 0 :(得分:0)

这不是编码问题,像í这些有趣的字符串被称为HTML entities

转换为正确的编码后,使用HttpUtility.HttpDecode(来自System.Web程序集)转换html实体。