使用简单的dom解析器和分页从电子商店获取产品

时间:2014-01-22 14:24:30

标签: php parsing pagination simple-html-dom

我想解析一些产品的链接,名称和价格。这是我的代码:解析时遇到一些问题,因为我不知道如何获得产品链接和名称。价格还可以,我明白了。而且分页也不起作用

 <h2>Telefonai Pigu</h2>
</br>
<?php
  include_once('simple_html_dom.php'); 
  $url = "http://pigu.lt/foto_gsm_mp3/mobilieji_telefonai/";
  // Start from the main page
  $nextLink = $url;

// Loop on each next Link as long as it exsists
while ($nextLink) {
echo "<hr>nextLink: $nextLink<br>";
//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a url
$html->load_file($nextLink);


$phones = $html->find('div#productList span.product');

foreach($phones as $phone) {
    // Get the link
    $linkas = $phone->href;

    // Get the name
    $pavadinimas = $phone->find('a[alt]', 0)->plaintext;

    // Get the name price and extract the useful part using regex
    $kaina = $phone->find('strong[class=nw]', 0)->plaintext;
    // This captures the integer part of decimal numbers: In "123,45" will capture      "123"... Use @([\d,]+),?@ to capture the decimal part too

    echo $pavadinimas, " #----# ", $kaina, " #----# ", $linkas, "<br>";

  //$query = "insert into telefonai (pavadinimas,kaina,linkas) VALUES (?,?,?)";
//  $this->db->query($query, array($pavadinimas,$kaina, $linkas));
}


// Extract the next link, if not found return NULL
$nextLink = ( ($temp = $html->find('div.pagination a[="rel"]', 0)) ? "https://www.pigu.lt".$temp->href : NULL );

// Clear DOM object
$html->clear();
unset($html);
}
?>

输出:

nextLink: http://pigu.lt/foto_gsm_mp3/mobilieji_telefonai/
A PHP Error was encountered
Severity: Notice
Message: Trying to get property of non-object
Filename: views/pigu_view.php
Line Number: 26
#----# 999,00 Lt #----#
A PHP Error was encountered
Severity: Notice
Message: Trying to get property of non-object
Filename: views/pigu_view.php
Line Number: 26

1 个答案:

答案 0 :(得分:1)

请仔细检查您正在处理的源代码,然后,基于此,您可以检索您想要的节点...与其他网站的兼容代码在这里工作是正常的,因为这两个网站没有相同的源代码/结构!

让我们一步一步地继续......

$phones = $html->find('div#productList span.product');将为您提供所有“手机容器”,或称为“块”...每个块具有以下结构:

<span class="product">
   <div class="fakeProductContainer">
      <p class="productPhoto">
         <span class="">
         <span class="flags flag-disc-value" title="Akcija"><strong>500<br><span class="currencySymbol">Lt</span></strong></span>
         <span class="flags freeShipping" title="Nemokamas prekių atsiemimas į POST24 paštomatus. Pasiūlymas galioja iki sausio 31 d."></span>
         </span>
         <a href="/foto_gsm_mp3/mobilieji_telefonai/telefonas_sony_xperia_acro_s?id=4522595" title="Telefonas Sony Xperia acro S" class="photo-medium nobr"><img src="http://lt1.pigugroup.eu//colours/48355/16/4835516/c503caf69ad97d889842a5fd5b3ff372_medium.jpg" title="Telefonas Sony Xperia acro S" alt="Telefonas Sony Xperia acro S"></a>
      </p>
      <div class="price">
         <strong class="nw">999,00 Lt</strong>
         <del class="nw">1.499,00 Lt *</del>
      </div>
      <h3><a href="/foto_gsm_mp3/mobilieji_telefonai/telefonas_sony_xperia_acro_s?id=4522595" title="Telefonas Sony Xperia acro S">Sony Xperia acro S</a></h3>
      <p class="descFields">
         3G: <em>HSDPA 14.4 Mbps, HSUPA 5.76 Mbps</em><br>
         GPS: <em>Taip</em><br>
         NFC: <em>Taip</em><br>
         Operacinė sistema: <em>Android OS</em><br>
      </p>
   </div>
</span>

包含产品链接的锚点包含在<p class="productPhoto">中,并且它是唯一的锚点,因此,要检索它只需使用$linkas = $phone->find('p.productPhoto a', 0)->href;(然后完成它,因为它只是相对链接)

产品名称位于<h3>标记中,我们再次使用$pavadinimas = $phone->find('h3 a', 0)->plaintext;来检索它

价格包含在<div class="price"><strong>中,我们再次使用$kaina = $phone->find('div[class=price] strong', 0)->plaintext来检索

然而,并非所有手机都显示其价格,因此,我们必须检查价格是否已正确检索

最后,包含下一个链接的HTML代码如下:

<div id="ListFootPannel">
   <div class="pages-list">
      <strong>1</strong>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=2">2</a>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=3">3</a>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=4">4</a>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=5">5</a>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=6">6</a>
      <a rel="next" href="/foto_gsm_mp3/mobilieji_telefonai?page=2">Toliau</a>      
   </div>
   <div class="pages-info">
      Prekių 
   </div>
</div>

因此,我们对<a rel="next">代码感兴趣,可以使用$html->find('div#ListFootPannel a[rel="next"]', 0)

检索

因此,如果我们将这些修改添加到原始代码中,我们将获得:

$url = "http://pigu.lt/foto_gsm_mp3/mobilieji_telefonai/";

// Start from the main page
$nextLink = $url;

// Loop on each next Link as long as it exsists
while ($nextLink) {
    echo "nextLink: $nextLink<br>";
    //Create a DOM object
    $html = new simple_html_dom();
    // Load HTML from a url
    $html->load_file($nextLink);

    ////////////////////////////////////////////////
    /// Get phone blocks and extract useful info ///
    ////////////////////////////////////////////////
    $phones = $html->find('div#productList span.product');

    foreach($phones as $phone) {
        // Get the link
        $linkas = "http://pigu.lt" . $phone->find('p.productPhoto a', 0)->href;

        // Get the name
        $pavadinimas = $phone->find('h3 a', 0)->plaintext;

        // If price not found, find() returns FALSE, then return 000
        if ( $tempPrice = $phone->find('div[class=price] strong', 0) ) {
            // Get the name price and extract the useful part using regex
            $kaina = $tempPrice->plaintext;
            // This captures the integer part of decimal numbers: In "123,45" will capture "123"... Use @([\d,]+),?@ to capture the decimal part too
            preg_match('@(\d+),?@', $kaina, $matches);
            $kaina = $matches[1];
        }
        else
            $kaina = "000";


        echo $pavadinimas, " #----# ", $kaina, " #----# ", $linkas, "<br>";

    }
    ////////////////////////////////////////////////
    ////////////////////////////////////////////////

    // Extract the next link, if not found return NULL
    $nextLink = ( ($temp = $html->find('div#ListFootPannel a[rel="next"]', 0)) ? "http://pigu.lt".$temp->href : NULL );

    // Clear DOM object
    $html->clear();
    unset($html);

    echo "<hr>";
}

Working DEMO