PHP Scraper似乎处于无限循环中

时间:2013-05-16 06:17:06

标签: php

(顺便说一句,我是在有问题的网站的许可下抓取这些东西的。)

非常简单的web scraper,当我手动加载所有链接时工作正常,但是当我尝试通过JSON和变量加载它们时(所以我可以使用一个脚本进行大量的抓取并使通过向JSON添加更多链接来处理更多模块化,它在无限循环上运行。

(页面已加载约15分钟)

这是我的JSON。只有一家商店在那里进行测试,但还有大约15家。

[
   {
      "store":"Incu Men",
      "cat":"Accessories",
      "general_cat":"Accessories",
      "spec_cat":"accessories",
      "url":"http://www.incuclothing.com/shop-men/accessories/",
      "baseurl":"http://www.incuclothing.com",
      "next_select":"a.next",
      "prod_name_select":".infobox .fn",
      "label_name_select":".infobox .brand",
      "desc_select":".infobox .description",
      "price_select":"#price",
      "mainImg_select":"",
      "more_imgs":".product-images",
      "product_url":".hproduct .photo-link"
   }
]

这是PHP scraper代码:

<?php
//Set infinite time limit
set_time_limit (0);
// Include simple html dom
include('simple_html_dom.php');
// Defining the basic cURL function
function curl($url) {
  $ch = curl_init();
    // Initialising cURL
    curl_setopt($ch, CURLOPT_URL, $url);
    // Setting cURL's URL option with the $url variable passed into the function
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    // Setting cURL's option to return the webpage data
    $data = curl_exec($ch);
    // Executing the cURL request and assigning the returned data to the $data variable
    curl_close($ch);
    // Closing cURL
    return $data;
    // Returning the data from the function
}

function getLinks($catURL, $prodURL, $baseURL, $next_select) {
    $urls = array();

    while($catURL) {
        echo "Indexing: $url" . PHP_EOL;
        $html = str_get_html(curl($catURL));

        foreach ($html->find($prodURL) as $el) {
            $urls[] = $baseURL . $el->href;
        }

        $next = $html->find($next_select, 0);
        $url = $next ? $baseURL . $next->href : null;

        echo "Results: $next" . PHP_EOL;
    }

    return $urls;
}

$string     = file_get_contents("jsonWorkers/incuMens.json");
$json_array = json_decode($string,true);

foreach ($json_array as $value){

    $baseURL = $value['baseurl'];
    $catURL = $value['url'];
    $store = $value['store'];
    $general_cat = $value['general_cat'];
    $spec_cat = $value['spec_cat'];
    $next_select = $value['next_select'];
    $prod_name = $value['prod_name_select'];
    $label_name = $value['label_name_select'];
    $description = $value['desc_select'];
    $price = $value['price_select'];
    $prodURL = $value['product_url'];

    if (!is_null($value['mainImg_select'])){
        $mainImg = $value['mainImg_select'];
    }
    $more_imgs = $value['more_imgs'];



    $allLinks = getLinks($catURL, $prodURL, $baseURL, $next_select);

}

?>

为什么脚本无限运行并且没有返回任何内容/停止/打印任何屏幕的任何想法?我要让它一直运行直到它停止。当我手工做这件事时,它只需要一分钟左右,有时甚至更少,所以我确定这是我的变量/ json的问题,但我不能为我的生活看到问题所在。

任何人都可以快速查看并指出我正确的方向吗?

1 个答案:

答案 0 :(得分:3)

您的while($catURL)循环存在问题。你想让我做什么 ? 此外,您可以使用flush()命令强制在浏览器上显示信息。

相关问题