Question

我想抓取谷歌搜索结果收集IMDB网址。每次我使用//ol[@id="rso"]//li[@class="g"]进行xpath查询时，DOMNodelist为空，不返回任何内容。使用var_dump进行调试，结果为object(DOMNodeList)#38 (0) { }。下面是脚本

function crawlIMDB($vtitle, $vid){
    $vtitle .= ' imdb';
    $vtitle = urlencode($vtitle);
    $plus = str_replace('%20', '+', $vtitle);
    $url = 'http://www.google.com/search?q='.$vtitle.'&gws_rd=ssl#q='.$plus;
    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($curl, CURLOPT_HEADER, FALSE);
    $response = curl_exec($curl);
    curl_close($curl);

    $doc = new DOMDocument();
    libxml_use_internal_errors(true);
    $doc->loadHTML($response);
    $xpath = new DOMXPath($doc);
    $entries = $xpath->query('//ol[@id="rso"]//li[@class="g"]');
    die(var_dump($entries));
}

当我尝试使用chrome扩展，XPath Helper调试查询时，查询看起来很好，返回结果。

DOM我寻找

<ol id="rso">
  <div class="srg">
  <li class="g"></li>
  </div>
</ol>

Answer 1

我将回答我自己的问题。

当我使用CURL时谷歌发送不同的响应，所以DOM有不同的结构。此xpath查询应用于从谷歌搜索结果中收集链接。

//h3[@class="r"]/a

希望这有帮助。感谢

刮google DOMNodelist总是空的

1 个答案: