简单的HTML Dom Parser - 跳过具有特定ID的元素

时间:2014-08-08 14:20:40

标签: dom simple-html-dom

我正在使用Simple HTML Dom Parser向Google查询特定关键字,然后循环浏览内容。但是,我不想查询广告或新闻框。由于列表元素具有不同的类,但newsbox li元素具有相同的类但具有附加ID,因此很容易排除广告。

结果li元素

<li class="g">...</li>

Newsbox li元素

<li class="g" id="newsbox">...</li>

如何使用ID新闻框排除li元素?

我在这里读了一遍,根据另一个人的建议,这是我最接近但是它没有工作:

$query = file_get_html('https://google.com/search?q=test');    
$li_elements = $query->find('li[class=g id!=newsbox]');

之前有任何其他想法或某人解决了这个问题吗?

更新

我仍然在努力,我几乎走到了尽头。这是我最新的代码:

include('simple_html_dom.php');

$html = file_get_html('https://www.google.co.uk/search?q=football');

// Find all article blocks
foreach($html->find('#res h3.r') as $article) {
    $item['title']     = $article->plaintext;
    $item['intro']    = $article->find('a', 0)->href;
    $articles[] = $item;
}

print_r($articles);

这是打印的数组

Array
(
[0] => Array
    (
        [title] => BBC Sport - Football
        [intro] => /url?q=http://www.bbc.co.uk/sport/0/football/&amp;sa=U&amp;ei=NkblU-s8h6nQBcCJgOAI&amp;ved=0CBQQFjAA&amp;usg=AFQjCNGHTFqXJoRjHKBSCdKFiW_BX6eGDw
    )

[1] => Array
    (
        [title] => News for football
        [intro] => /search?q=football&amp;ie=UTF-8&amp;prmd=ivnsl&amp;source=univ&amp;tbm=nws&amp;tbo=u&amp;sa=X&amp;ei=NkblU-s8h6nQBcCJgOAI&amp;ved=0CB8QqAI
    )

[2] => Array
    (
        [title] => Football Games, Results, Scores, Transfers, News | Sky Sports
        [intro] => /url?q=http://www1.skysports.com/football/&amp;sa=U&amp;ei=NkblU-s8h6nQBcCJgOAI&amp;ved=0CCgQFjAE&amp;usg=AFQjCNE4VP4WAHIYJAoPIBJoUx1pC-1jBA
    )

[3] => Array
    (
        [title] => Local business results for football near London NW5
        [intro] => https://maps.google.co.uk/maps?um=1&amp;ie=UTF-8&amp;fb=1&amp;gl=uk&amp;q=football&amp;hq=football&amp;hnear=0x48761a535791ef6f:0x493f677c231558c8,London+NW5&amp;sa=X&amp;ei=NkblU-s8h6nQBcCJgOAI&amp;ved=0CC4QtQM
    )

[4] => Array
    (
        [title] => Football news, match reports and fixtures | Football | The Guardian
        [intro] => /url?q=http://www.theguardian.com/football&amp;sa=U&amp;ei=NkblU-s8h6nQBcCJgOAI&amp;ved=0CE4QFjAM&amp;usg=AFQjCNHPhgIljb53cFPRHlb1vCa1fmWJag
    )

[5] => Array
    (
        [title] => NewsNow: Football News | Breaking News &amp; Search 24/7
        [intro] => /url?q=http://www.newsnow.co.uk/h/Sport/Football&amp;sa=U&amp;ei=NkblU-s8h6nQBcCJgOAI&amp;ved=0CFQQFjAN&amp;usg=AFQjCNEmmlrEayvHdebKTfPykGhHxRioLA
    )

[6] => Array
    (
        [title] => Football365 - Football News, Views, Gossip and much more...
        [intro] => /url?q=http://www.football365.com/&amp;sa=U&amp;ei=NkblU-s8h6nQBcCJgOAI&amp;ved=0CFoQFjAO&amp;usg=AFQjCNFKIP3xgtxw9DhNtOhVfpT4pbpLPw
    )

[7] => Array
    (
        [title] => Football - Wikipedia, the free encyclopedia
        [intro] => /url?q=http://en.wikipedia.org/wiki/Football&amp;sa=U&amp;ei=NkblU-s8h6nQBcCJgOAI&amp;ved=0CGAQFjAP&amp;usg=AFQjCNF2Fk8WH4rzEvWzmYIEUycZnjvjpg
    )

[8] => Array
    (
        [title] => Football in London - Things To Do - visitlondon.com
        [intro] => /url?q=http://www.visitlondon.com/things-to-do/whats-on/sport/football&amp;sa=U&amp;ei=NkblU-s8h6nQBcCJgOAI&amp;ved=0CGYQFjAQ&amp;usg=AFQjCNEdSNJc-mlVpaWEY9yPjcoDSaDLIw
    )

[9] => Array
    (
        [title] => London Football Leagues - 5-a-side - 7-a-side - 11-a-side - Midweek ...
        [intro] => /url?q=http://www.londonfootball.co.uk/&amp;sa=U&amp;ei=NkblU-s8h6nQBcCJgOAI&amp;ved=0CHMQFjAR&amp;usg=AFQjCNGnZtZQxUmUYQtDF0Tj5nJRnR2Yig
    )

[10] => Array
    (
        [title] => Football Tickets and Event Details | Ticketmaster UK Sport
        [intro] => /url?q=http://www.ticketmaster.co.uk/browse/football-catid-11/sport-rid-10004&amp;sa=U&amp;ei=NkblU-s8h6nQBcCJgOAI&amp;ved=0CHkQFjAS&amp;usg=AFQjCNFwTfpq-klboIEf0EbhlMQWvzHeKQ
    )

我不明白为什么第二个结果array[1][title]存储在数组中,因为根据这一行$html->find('#res h3.r') as $article它不应该存在。它既不包含在id #res的div中,也不包含在h3标签内。

有什么想法吗?

2 个答案:

答案 0 :(得分:0)

不幸的是,简单的HTML Dom Parser不支持这种灵活性,但是可以找到一个可行的方法......

您可以先删除不需要的块,然后检索正确的块:

  1. $query->find('li#newsbox', 0)->outertext = '';
  2. $li_elements = $query->find('li.g');
  3. 编辑:

    以下是显示其工作原理的示例代码:

    $input =  <<<_DATA_
    <div class="g" id="newsbox">Bad node</div>
    <div class="g">Usefull node</div>
    _DATA_;
    
    // Create a DOM object
    $html = new simple_html_dom();
    // Load HTML from a string
    $html->load($input);
    
    // Remove the bad node
    $html->find('div#newsbox', 0)->outertext = ''; // Comment this line to print the original html content
    
    echo $html;
    

    Working code

答案 1 :(得分:0)

simple_html_dom声称支持它,所以它似乎是一个错误。

选择li.g:not(#newsbox)的正确css方式不是简单支持,而是由this one支持。