如何使用cURL获取<a> This Text </a>标记内的文本?

时间:2013-04-14 23:10:00

标签: php parsing curl html-parsing

我收到此错误“致命错误:使用此代码调用未定义的方法DOMText :: getAttribute()”。我想捕获链接的文本而不是源(我不知道它叫什么)。有人可以向我解释我的错误或告诉我一个不同的方式这样做吗?这是我的代码:

<?php

$target_url = "SITE I WANT";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
    echo "<br />cURL error number:" .curl_errno($ch);
    echo "<br />cURL error:" . curl_error($ch);
    exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a/text()");

for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');
    storeLink($url,$target_url);
    echo "<br />Link stored: $url";
}
$id = "12";
   $query = "DELETE FROM links WHERE id<=$id";
    if(!mysql_query($query))
        echo "DELETE failed: $query<br />" . 
        mysql_error() . "<br /><br />";
        ?>

1 个答案:

答案 0 :(得分:0)

你去了:

$document = new DOMDocument();
$document->loadHTML($html);
$selector = new DOMXPath($document);
$anchors = $selector->query('/html/body//a');

foreach($anchors as $a) { 
    $text = $a->nodeValue;
    $href = $a->getAttribute('href');
    echo($text . ' : ' . $href . '<br />');

}