Question

我有一个脚本需要一些HTML并尝试从中提取一些数据。我正在使用的数据包含在span.CardTitle中的字段名称，其中包含以下文本中的数据。不幸的是，所有字段和数据都是彼此的兄弟，使得难以提取。这是我目前的剧本（缩写为相关要点）：

$time = microtime(true);

$curr_card = array();
$item = $list->item($i);
$cardPath = getHTML($base . $item->getAttribute('href'));

$time = microtime(true) - $time;
echo 'Time to download and load card info: ' . $time . '<br />';

$title = $cardPath->evaluate('//div[@class=\'WordSection1\']/h4')->item(0)->textContent;
preg_match('/\s\(([A-Za-z0-9]+)\)/', $title, $curr_set);
$curr_card['set'] = $curr_set[1];
$curr_card['card_name'] = preg_replace('/\s\([A-Za-z0-9]+\)/', '', $title);

echo 'Getting field data for ' . $curr_card['card_name'] . '<br />';

$fields = $cardPath->evaluate('//div[@class=\'WordSection1\']/p[@class=\'Definition\']/span[@class=\'CardTitle\']');

$time = $field_time = microtime(true);
echo '# of fields: ' . $fields->length . '<br />';

for($a = 0; $a < $fields->length; $a++)
{
    $field = $fields->item($a);

    $fieldName = $field->textContent;
    echo 'Field Name: ' . $fieldName . '<br />';

    $fieldData = recursiveSibling($field->nextSibling);
    echo 'Field Data: ' . $fieldData . '<br />';

    $field_time = microtime(true) - $field_time;
    $fieldnum = $a + 1;
    echo 'Field #' . $fieldnum . ' took ' . $field_time . ' to process. <br />';

$field_time = microtime(true);
}
$time = microtime(true) - $time;
echo 'Time to extract card info: ' . $time . '<br />';

function getHTML($url, $xpath = true)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, 'Firefox (WindowsXP) – Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6');
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);
    $html = curl_exec($ch);
    if (!$html) {
        echo "<br />cURL error number:" .curl_errno($ch);
        echo "<br />cURL error:" . curl_error($ch);
        exit;
    }
    if($xpath)
    {
        $dom = new DOMDocument();
        @$dom->loadHTML($html); 
        return new DOMXPath($dom);
    }
    else
        return $html;
}

function recursiveSibling($node)
{
    if(strstr($node->nodeName, 'span') === false)
    {
        $text = $node->textContent . recursiveSibling($node->nextSibling);
        return $text;
    }
}

这是脚本输出的内容：

Time to download and load master list: 0.495495080948
Time to download and load card info: 0.106231927872

Getting field data for A Child is Born
# of fields: 9

Field Name: Type: 
Field Data: Hero Enh. Â• 
Field #1 took 3.60012054443E-5 to process. 

Field Name: Brigade: 
Field Data: White Â• 
Field #2 took 1.00135803223E-5 to process. 

Field Name: Ability: 
Field Data: None Â• 
Field #3 took 8.10623168945E-6 to process. 

Field Name: Class: 
Field Data: None Â• 
Field #4 took 7.15255737305E-6 to process. 

Field Name: Special Ability: 
Field Data: Discard all Demons in Play. Cannot be interrupted, negated, or prevented. Â• 
Field #5 took 3.31401824951E-5 to process. 

Field Name: Errata: 
Field Data: Discard all demons in play. Cannot be negated. Â• 
Field #6 took 1.50203704834E-5 to process. 

Field Name: Identifiers: 
Field Data: None Â• 
Field #7 took 6.91413879395E-6 to process. 

Field Name: Verse: 
Field Data: None Â• 
Field #8 took 5.96046447754E-6 to process. 

Field Name: Availability:

我不明白为什么执行需要这么长时间（大约40秒），而且我理解为什么最后一个字段会破坏脚本。如果它有帮助，这是我正在从http://www.redemptionreg.com/REG/Master/achildisbornp.htm

中提取的页面

如果有人能向我解释我做错了什么，以及如何让它更快，我将不胜感激。有超过2000张卡可以执行此操作，每次45秒，这是超过24小时的脚本执行！

Answer 1

我弄明白了这个问题。整个问题是在最后一个字段（可用性）之后没有跨度。因此，recursiveSibling函数进入无限递归。添加一个条件以检查是否有另一个节点后，它就可以了。

慢php DOMNode操作中断

1 个答案: