当使用维基百科api服务器端时,单词出现在乱码中

时间:2014-05-06 05:31:52

标签: php curl wikipedia wikipedia-api

我试图从维基百科文章中获取简短的摘录。在我的浏览器中使用以下网址: http://en.wikipedia.org//w/api.php?action=query&prop=extracts&format=txt&exsentences=2&exlimit=10&exintro=&explaintext=&iwurl=&titles=Greek%20language

我在浏览器中收到以下结果:

Array
(
[query] => Array
    (
        [pages] => Array
            (
                [11887] => Array
                    (
                        [pageid] => 11887
                        [ns] => 0
                        [title] => Greek language
                        [extract] => Greek (Modern Greek: ελληνικά [eliniˈka] "Greek" and ελληνική γλώσσα [eliniˈci ˈɣlosa] ( ) "Greek language") is an independent branch of the Indo-European family of languages. Native to the southern Balkans, western Asia Minor, Greece, the Aegean Islands, and Cyprus it has the longest documented history of any Indo-European language, spanning 34 centuries of written records. 
                    )

            )

    )

)

哪个好。

问题是,当我使用相同的网址试图用PHP服务器端用CURL抓住它时,外国字母显示为乱码。以下是我尝试这样做的方法:

$url = 'http://en.wikipedia.org//w/api.php?action=query&prop=extracts&format=txt&exsentences=2&exlimit=10&exintro=&explaintext=&iwurl=&titles=Greek%20language';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "TestScript"); 
$c = curl_exec($ch);
echo $c;

给了我以下结果:

Array ( [query] => Array ( [pages] => Array ( [11887] => Array ( [pageid] => 11887 [ns] => 0 [title] => Greek language [extract] => Greek (Modern Greek: ελληνικά [eliniˈka] "Greek" and ελληνική γλώσσα [eliniˈci ˈɣlosa] ( ) "Greek language") is an independent branch of the Indo-European family of languages. Native to the southern Balkans, western Asia Minor, Greece, the Aegean Islands, and Cyprus it has the longest documented history of any Indo-European language, spanning 34 centuries of written records. ) ) ) )

但外语是胡言乱语。我和其他有关外语的文章得到了相同的结果。如何正确接收和出示外国字母?

1 个答案:

答案 0 :(得分:1)

您需要设置header

<?php
header('Content-Type: text/html;charset=utf-8'); //<--- Add this

这是因为这些字符是Unicode格式,因此您需要隐式设置标题以反映字符集。