解析WikiPedia简介PHP

时间:2011-05-03 10:28:27

标签: php xml parsing wikipedia

我已阅读本网站上的其他问题 - 使用此处给出的示例答案 -

wikipedia api: get parsed introduction only

我已经到了维基百科文章的第一部分。但第一部分包括图片以及文字。我想要的只是文字。这是我的cURL回复中输出的html

 $ Array
(
[parse] => Array
    (
        [text] => Array
            (
                [*] => <div class="dablink">This article is about sports known as    football.  For the ball used in these sports, see <a href="/wiki/Football_(ball)">Football  (ball)</a>.</div> 
   <div class="thumb tright"> 
   <div class="thumbinner" style="width:227px;"><a href="/wiki/File:Football4.png"   class="image"><img alt=""    src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Football4.png/225px-   Football4.png" width="225" height="274" class="thumbimage" /></a> 
   <div class="thumbcaption"> 
   <div class="magnify"><a href="/wiki/File:Football4.png" class="internal"  title="Enlarge"><img src="http://bits.wikimedia.org/skins-1.17/common/images/magnify- clip.png" width="15" height="11" alt="" /></a></div> 
   Some of the many different games known as football. From top left to bottom right:      <a href="/wiki/Association_football">Association football</a> or soccer, <a   href="/wiki/Australian_rules_football">Australian rules football</a>, <a  href="/wiki/International_rules_football">International rules football</a>, <a  href="/wiki/Rugby_Union" class="mw-redirect" title="Rugby Union">Rugby Union</a>, <a  href="/wiki/Rugby_League" class="mw-redirect" title="Rugby League">Rugby League</a>, and <a  href="/wiki/American_Football" class="mw-redirect" title="American Football">American   Football</a>.</div> 
  </div> 
  </div> 
  <p>The game of <b>football</b> is any of several similar <a href="/wiki/Team_sport"  title="Team sport">team sports</a>, of similar origins which involve advancing a ball into   a goal area in an attempt to score. Many of these involve <a href="/wiki/Kick_(football)"  title="Kick (football)">kicking</a> a ball with the foot to score a <a  href="/wiki/Goal_(sport)" title="Goal (sport)">goal</a>, though not all codes of football  using kicking as a primary means of advancing the ball or scoring. The most popular of these sports worldwide is <a href="/wiki/Association_football">association football</a>,   more commonly known as just "football" or "soccer". Unqualified, the word <i><a  href="/wiki/Football_(word)" title="Football (word)">football</a></i> applies to whichever  form of football is the most popular in the regional context in which the word appears,  including <a href="/wiki/American_football">American football</a>, <a href="/wiki/Australian_rules_football">Australian rules football</a>, <a  href="/wiki/Canadian_football">Canadian football</a>, <a  href="/wiki/Gaelic_football">Gaelic football</a>, <a href="/wiki/Rugby_league">rugby  league</a>, <a href="/wiki/Rugby_union">rugby union</a> and other related games. These variations are known as "codes".</p> 
    <div class="toclimit-3"></div> 

我真正想要的代码是否位于段落标签中,如果有用的话? (从单词 - “游戏”开始

我在网址中抓取数据的网址就是这个 -

 'http://en.wikipedia.org/w/api.php?action=parse&page='.$search.'&redirects=1&format=json&prop=text&section=0'

我尝试过的示例代码 -

 <?php

 include_once('simple_html_dom.php');

 $html = file_get_html('http://amazon.co.uk/');

 foreach($html->find('p') as $element)   
 {
 echo $element->plaintext . '<br>';
 }

 ?>

遗憾的是,这会返回一个空白页

1 个答案:

答案 0 :(得分:1)

只需下载Simple HTML DOM parser

即可

然后使用它:

include_once('simple_html_dom.php');

$html = file_get_html('http://en.wikipedia.org/wiki/Football');

foreach($html->find('p') as $element)   
{
    echo $element->plaintext . '<br>';
    break;
}
相关问题