Wikipedia JSON API检索没有链接的页面内容

时间:2012-05-21 05:11:20

标签: iphone objective-c xml json wikipedia-api

我正在使用维基百科JSON API,我带来检索没有链接的页面内容 例如,

https://en.wikipedia.org/w/api.php?action=query&format=json&titles=May_21&prop=revisions&rvprop=content&rvsection=1

例如:

[[293]] – Roman Emperors [[Diocletian]] and [[Maximian]] appoint [[Galerius]] as [[Caesar (title)|''Caesar'']] to Diocletian, beginning the period of four rulers known as the [[Tetrarchy]].

&ndash替换为-

[[Caesar (title)|''Caesar'']]应为Caesar

我正在使用Objective-C

如何检索相同的网页内容,但没有链接字符?

谢谢!

4 个答案:

答案 0 :(得分:2)

使用HTML转文本转换器(例如links或某些浏览器模拟器,例如PhantomJS)。比将wiki文本转换为文本更少痛苦,在这种情况下,您将不得不处理模板。

答案 1 :(得分:1)

应该是: - )

NSString * stringToParse = @"{\"query\":{\"normalized\":[{\"from\":\"May_21\",\"to\":\"May 21\"}],\"pages\":{\"19684\":{\"pageid\":19684,\"ns\":0,\"title\":\"May 21\",\"revisions\":[{\"*\":\"==Events==\\n* [[293]] – Roman Emperors [[Diocletian]] and [[Maximian]] appoint [[Galerius]] as [[Caesar (title)|''Caesar'']] to Diocletian, beginning the period of four rulers known as the [[Tetrarchy]].\\n* [[878]] – [[Syracuse, Italy]], is [[Muslim conquest of Sicily|captured]] by the ...";

//Replace &ndash with -
stringToParse = [stringToParse stringByReplacingOccurrencesOfString:@"&ndash" withString:@"-"];

//[[Caesar (title)|''Caesar'']] Should be Caesar
//and [[Maximian]] should be Maximian
//same for [[1972]] -> 1972
NSString *regexToReplaceWikiLinks = @"\\[\\[([A-Za-z0-9_ ()]+?\\|)?(\\'\\')?(.+?)(\\'\\')?\\]\\]";

NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:regexToReplaceWikiLinks
                                                                       options:NSRegularExpressionCaseInsensitive
                                                                         error:&error];

// attention, the found expression is replacex with the third parenthesis
NSString *modifiedString = [regex stringByReplacingMatchesInString:stringToParse
                                                           options:0
                                                             range:NSMakeRange(0, [stringToParse length])
                                                      withTemplate:@"$3"];

NSLog(@"%@", modifiedString);

结果:

{"query":{"normalized":[{"from":"May_21","to":"May 21"}],"pages":{"19684":{"pageid":19684,"ns":0,"title":"May 21","revisions":[{"*":"==Events==\n* 293 -; Roman Emperors Diocletian and Maximian appoint Galerius as Caesar to Diocletian, beginning the period of four rulers known as the Tetrarchy.\n* 878 -; Syracuse, Italy, is captured by the ...

答案 2 :(得分:0)

Regular expressions是解决这个问题的方法;以下是使用JavaScript的示例(但您可以将相同的解决方案应用于具有正则表达式的任何语言);

<dl>
    <script type="text/javascript">

        var source = "[[293]] &ndash; Roman Emperors [[Diocletian]] and [[Maximian]] appoint [[Galerius]] as [[Caesar (title)|''Caesar'']] to Diocletian, beginning the period of four rulers known as the [[Tetrarchy]].";

        document.writeln('<dt> Original </dt>');
        document.writeln('<dd>' + source + '</dd>');

        // Replace links with any found titles
        var matchTitles = /\[\[([^\]]+?)\|\'\'(.+?)\'\']\]/ig; /* <- Answer */
        source = source.replace(matchTitles, '$2');

        document.writeln('<dt> First Pass </dt>');
        document.writeln('<dd style="color: green;">' + source + '</dd>');

        // Replace links with contents
        var matchLinks = /\[\[(.+?)\]\]/ig;
        source = source.replace(matchLinks, '$1');

        document.writeln('<dt> Second Pass </dt>');
        document.writeln('<dd>' + source + '</dd>');
    </script>
</dl>

你也可以在这里看到这个:http://jsfiddle.net/NujmB/

答案 3 :(得分:0)

我不知道目标C,但这里是我用于同一目的的javascript代码
(它可以作为psedo代码给你并帮助其他用户从javascript)

 var url = 'http://en.wikipedia.org/w/api.php?callback=?&action=parse&page=facebook&prop=text&format=json&section=0';
     // Section = 0 for taking first section of wiki page i.e. introduction only     
            $.getJSON(url,function(response){
                // Taking only the first paragraph from introduction
                var intro = $(response.parse.text['*']).filter('p:eq(0)').html();
                var wikiBox = $('#wikipediaBox .wikipedia div.overview');
                wikiBox.empty().html(intro);
                // Converting relative links into absolute ones and links into outer links
                wikiBox.find("a:not(.references a)").attr("href", function(){ return "http://www.wikipedia.org" + $(this).attr("href");});
                wikiBox.find("a").attr("target", "_blank");
                // Removing edits markers
                wikiBox.find('sup.reference').remove(); 
            });