我有一个脚本需要一些HTML并尝试从中提取一些数据。我正在使用的数据包含在span.CardTitle中的字段名称,其中包含以下文本中的数据。不幸的是,所有字段和数据都是彼此的兄弟,使得难以提取。这是我目前的剧本(缩写为相关要点):
$time = microtime(true);
$curr_card = array();
$item = $list->item($i);
$cardPath = getHTML($base . $item->getAttribute('href'));
$time = microtime(true) - $time;
echo 'Time to download and load card info: ' . $time . '<br />';
$title = $cardPath->evaluate('//div[@class=\'WordSection1\']/h4')->item(0)->textContent;
preg_match('/\s\(([A-Za-z0-9]+)\)/', $title, $curr_set);
$curr_card['set'] = $curr_set[1];
$curr_card['card_name'] = preg_replace('/\s\([A-Za-z0-9]+\)/', '', $title);
echo 'Getting field data for ' . $curr_card['card_name'] . '<br />';
$fields = $cardPath->evaluate('//div[@class=\'WordSection1\']/p[@class=\'Definition\']/span[@class=\'CardTitle\']');
$time = $field_time = microtime(true);
echo '# of fields: ' . $fields->length . '<br />';
for($a = 0; $a < $fields->length; $a++)
{
$field = $fields->item($a);
$fieldName = $field->textContent;
echo 'Field Name: ' . $fieldName . '<br />';
$fieldData = recursiveSibling($field->nextSibling);
echo 'Field Data: ' . $fieldData . '<br />';
$field_time = microtime(true) - $field_time;
$fieldnum = $a + 1;
echo 'Field #' . $fieldnum . ' took ' . $field_time . ' to process. <br />';
$field_time = microtime(true);
}
$time = microtime(true) - $time;
echo 'Time to extract card info: ' . $time . '<br />';
function getHTML($url, $xpath = true)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'Firefox (WindowsXP) – Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6');
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
if($xpath)
{
$dom = new DOMDocument();
@$dom->loadHTML($html);
return new DOMXPath($dom);
}
else
return $html;
}
function recursiveSibling($node)
{
if(strstr($node->nodeName, 'span') === false)
{
$text = $node->textContent . recursiveSibling($node->nextSibling);
return $text;
}
}
这是脚本输出的内容:
Time to download and load master list: 0.495495080948
Time to download and load card info: 0.106231927872
Getting field data for A Child is Born
# of fields: 9
Field Name: Type:
Field Data: Hero Enh. •
Field #1 took 3.60012054443E-5 to process.
Field Name: Brigade:
Field Data: White •
Field #2 took 1.00135803223E-5 to process.
Field Name: Ability:
Field Data: None •
Field #3 took 8.10623168945E-6 to process.
Field Name: Class:
Field Data: None •
Field #4 took 7.15255737305E-6 to process.
Field Name: Special Ability:
Field Data: Discard all Demons in Play. Cannot be interrupted, negated, or prevented. •
Field #5 took 3.31401824951E-5 to process.
Field Name: Errata:
Field Data: Discard all demons in play. Cannot be negated. •
Field #6 took 1.50203704834E-5 to process.
Field Name: Identifiers:
Field Data: None •
Field #7 took 6.91413879395E-6 to process.
Field Name: Verse:
Field Data: None •
Field #8 took 5.96046447754E-6 to process.
Field Name: Availability:
我不明白为什么执行需要这么长时间(大约40秒),而且我理解为什么最后一个字段会破坏脚本。如果它有帮助,这是我正在从http://www.redemptionreg.com/REG/Master/achildisbornp.htm
中提取的页面如果有人能向我解释我做错了什么,以及如何让它更快,我将不胜感激。有超过2000张卡可以执行此操作,每次45秒,这是超过24小时的脚本执行!
答案 0 :(得分:0)
我弄明白了这个问题。整个问题是在最后一个字段(可用性)之后没有跨度。因此,recursiveSibling函数进入无限递归。添加一个条件以检查是否有另一个节点后,它就可以了。