慢php DOMNode操作中断

时间:2012-01-08 06:12:57

标签: php

我有一个脚本需要一些HTML并尝试从中提取一些数据。我正在使用的数据包含在span.CardTitle中的字段名称,其中包含以下文本中的数据。不幸的是,所有字段和数据都是彼此的兄弟,使得难以提取。这是我目前的剧本(缩写为相关要点):

$time = microtime(true);

$curr_card = array();
$item = $list->item($i);
$cardPath = getHTML($base . $item->getAttribute('href'));

$time = microtime(true) - $time;
echo 'Time to download and load card info: ' . $time . '<br />';

$title = $cardPath->evaluate('//div[@class=\'WordSection1\']/h4')->item(0)->textContent;
preg_match('/\s\(([A-Za-z0-9]+)\)/', $title, $curr_set);
$curr_card['set'] = $curr_set[1];
$curr_card['card_name'] = preg_replace('/\s\([A-Za-z0-9]+\)/', '', $title);

echo 'Getting field data for ' . $curr_card['card_name'] . '<br />';

$fields = $cardPath->evaluate('//div[@class=\'WordSection1\']/p[@class=\'Definition\']/span[@class=\'CardTitle\']');

$time = $field_time = microtime(true);
echo '# of fields: ' . $fields->length . '<br />';

for($a = 0; $a < $fields->length; $a++)
{
    $field = $fields->item($a);

    $fieldName = $field->textContent;
    echo 'Field Name: ' . $fieldName . '<br />';

    $fieldData = recursiveSibling($field->nextSibling);
    echo 'Field Data: ' . $fieldData . '<br />';

    $field_time = microtime(true) - $field_time;
    $fieldnum = $a + 1;
    echo 'Field #' . $fieldnum . ' took ' . $field_time . ' to process. <br />';

$field_time = microtime(true);
}
$time = microtime(true) - $time;
echo 'Time to extract card info: ' . $time . '<br />';

function getHTML($url, $xpath = true)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, 'Firefox (WindowsXP) – Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6');
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);
    $html = curl_exec($ch);
    if (!$html) {
        echo "<br />cURL error number:" .curl_errno($ch);
        echo "<br />cURL error:" . curl_error($ch);
        exit;
    }
    if($xpath)
    {
        $dom = new DOMDocument();
        @$dom->loadHTML($html); 
        return new DOMXPath($dom);
    }
    else
        return $html;
}

function recursiveSibling($node)
{
    if(strstr($node->nodeName, 'span') === false)
    {
        $text = $node->textContent . recursiveSibling($node->nextSibling);
        return $text;
    }
}

这是脚本输出的内容:

Time to download and load master list: 0.495495080948
Time to download and load card info: 0.106231927872

Getting field data for A Child is Born
# of fields: 9

Field Name: Type: 
Field Data: Hero Enh. • 
Field #1 took 3.60012054443E-5 to process. 

Field Name: Brigade: 
Field Data: White • 
Field #2 took 1.00135803223E-5 to process. 

Field Name: Ability: 
Field Data: None • 
Field #3 took 8.10623168945E-6 to process. 

Field Name: Class: 
Field Data: None • 
Field #4 took 7.15255737305E-6 to process. 

Field Name: Special Ability: 
Field Data: Discard all Demons in Play. Cannot be interrupted, negated, or prevented. • 
Field #5 took 3.31401824951E-5 to process. 

Field Name: Errata: 
Field Data: Discard all demons in play. Cannot be negated. • 
Field #6 took 1.50203704834E-5 to process. 

Field Name: Identifiers: 
Field Data: None • 
Field #7 took 6.91413879395E-6 to process. 

Field Name: Verse: 
Field Data: None • 
Field #8 took 5.96046447754E-6 to process. 

Field Name: Availability: 

我不明白为什么执行需要这么长时间(大约40秒),而且我理解为什么最后一个字段会破坏脚本。如果它有帮助,这是我正在从http://www.redemptionreg.com/REG/Master/achildisbornp.htm

中提取的页面

如果有人能向我解释我做错了什么,以及如何让它更快,我将不胜感激。有超过2000张卡可以执行此操作,每次45秒,这是超过24小时的脚本执行!

1 个答案:

答案 0 :(得分:0)

我弄明白了这个问题。整个问题是在最后一个字段(可用性)之后没有跨度。因此,recursiveSibling函数进入无限递归。添加一个条件以检查是否有另一个节点后,它就可以了。