php xpath crawler无法在标记

时间:2017-02-06 12:57:55

标签: php xpath

我尝试在php中开发一个抓取工具,跟踪网店比较网站上某些产品的最优价格。我有一个带有链接的txt文件,我抓取这些链接,并从这些链接中获取我需要的信息。

<!DOCTYPE html>
<html>
<head>
    <link rel='stylesheet' type='text/css' href='crawlerStyle.css'>
</head>
<body>
<div class='div-table-row'>
<div class='div-table-col-title'><span class='span-title'>Name</span></div>
<div class='div-table-col-title'><span class='span-title'>Best Pricerunner price</span></div>
</div>
<?php 

$myfile = fopen("urls.txt", "r") or die("Unable to open file!");
if ($myfile) {
    while (($line = fgets($myfile)) !== false) {
        @follow_links($line);
    }
    fclose($myfile);
}

function getPRPrice($priceTag){
    return substr($priceTag, 2).",00 DKK";
}
function follow_links($line) {
    libxml_use_internal_errors(true);
    $doc = new DOMDocument();
    @$doc->loadHTML(file_get_contents($line));
    $xpath = new DOMXpath($doc);    

    $name = $xpath->query( '////span[@class="fn" and @itemprop="name"]')->item(0);
    $price = $xpath->query( '//ul[@class="itemlist" and li[@class="shoppingcol" and p[@class="button" and a[@class="button-a google-analytic-retailer-data"]]]]/*/*/*/*/*/strong[@class="validated-shipping"]')->item(0);
    $company = $xpath->query( '//ul[@class="itemlist" and li[@class="shoppingcol" and p[@class="button" and a[@class="button-a google-analytic-retailer-data"]]]]/*/*/a[@class="google-analytic-retailer-data"]//img/@src')->item(0);

    echo "<div class='div-table-row'>\n";
    echo "<div class='div-table-col'><span>".substr($name->textContent, 0, -18)."</span></div>\n";
    echo "<div class='div-table-col'><img style='display: inline-block; vertical-align:middle' src='".$company->textContent."'><a href='".$line."' target='_blank'><span>".getPRPrice($price->textContent)."</span></a></div>\n";
    echo "</div>\n";
}
?>
</body>
</html> 

这是一些css样式,以便您可以看到我看到的内容:

.div-table-row{
  display:table;
  clear:both;
}
.div-table-col{
  float: none;
  border-style: solid;
  width: 250px;         
  display: table-cell;
  text-align:center;
  vertical-align: middle;
  height: 100%;
}
.div-table-col-title{
  float: none;
  border-style: solid;
  width: 250px;         
  display: table-cell;
  text-align:center;
  vertical-align: middle;
  font-size: 30px;
  height: 100%;
  background: rgb(30, 139, 45) !important;
}
.productImg{
    display:none; 
    position: absolute;
    width: 200px;
}
span{
  height: 100%;
  width: 100%;
  padding-left:10px; 
  padding-right:10px; 
  vertical-align: middle;
  text-align:center;
  font-size: 16px;
  font-weight: 600;
  font-family: "Helvetica Neue",Helvetica,Arial,sans-serif;
}
.span-title{
  height: 100%;
  width: 100%;
  padding-left:10px; 
  padding-right:10px; 
  vertical-align: middle;
  text-align:center;
  font-size: 20px;
  color: white;
  font-weight: 900;
  font-family: "Helvetica Neue",Helvetica,Arial,sans-serif;
}

这就是我试图抓取的网页的一些产品的方式

how it looks like

但我为这个名字所采取的范围似乎并没有完全归还。

product from Pricerunner

有没有人对这个问题有所了解?

谢谢!

修改!!我使用以下链接进行测试:

http://www.pricerunner.dk/pl/1-3140663/Mobiltelefoner/Microsoft-Lumia-650-Sammenlign-Priser
http://www.pricerunner.dk/pl/1-3098807/Mobiltelefoner/Apple-iPhone-6S-64GB-Sammenlign-Priser
http://www.pricerunner.dk/pl/1-3141579/Mobiltelefoner/Samsung-Galaxy-S7-Edge-32GB-Sammenlign-Priser 
http://www.pricerunner.dk/pl/1-3154462/Mobiltelefoner/HTC-10-32GB-Sammenlign-Priser

2 个答案:

答案 0 :(得分:1)

这对我来说很好用

// Please notice the use of only two slashes and not four like you did
$name = $xpath->query('//span[@class="fn"]')->item(0)->textContent;

问题来自您之后申请的substr

答案 1 :(得分:0)

这是令人尴尬的!问题是substring正在削减$name变量。我不久前用它来取名字。