Question

我正在尝试为一些朋友创建一个简单的提醒应用。

基本上我希望能够从以下两个网页中提取数据“价格”和“库存可用性”：

我已经通过电子邮件和短信部分发出警报但现在我希望能够从网页（那些2或任何其他网页）中获取数量和价格，以便我可以比较可用的价格和数量如果产品介于某个阈值之间，请提醒我们订购。

我已经尝试了一些正则表达式（在一些教程中找到，但我的方式太过于n00b）但是还没有设法让这个工作，任何好的提示或示例？

Answer 1

$content = file_get_contents('http://www.sparkfun.com/commerce/product_info.php?products_id=9279');

preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match);
$price = $match[1];

preg_match('#<input type="hidden" name="quantity_on_hand" value="(.*?)">#', $content, $match);
$in_stock = $match[1];

echo "Price: $price - Availability: $in_stock\n";

Answer 2

这称为屏幕抓取，以防你需要谷歌搜索。

我建议您使用dom解析器和xpath表达式。首先通过HtmlTidy提供HTML，以确保它是有效的标记。

例如：

$html = file_get_contents("http://www.example.com");
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//table[@class="pricing"]/th') as $node) {
  echo $node, "\n";
}

Answer 3

你做了什么：不要使用正则表达式来解析HTML或bad things will happen。请改用parser。

Answer 4

最好将HTML代码加载到像this one这样的DOM解析器中，然后搜索“定价”表。但是，任何类型的抓取都会在他们改变页面布局时中断，并且在未经他们同意的情况下可能是非法的。

但最好的方法是与运行该网站的人交谈，看看他们是否有其他更可靠的数据传输形式（Web服务，RSS或数据库导出）。

Answer 5

1，问这个问题太详细了。第二，从网站提取数据可能不合法。但是，我有提示：

使用Firebug或Chrome / Safari Inspector探索有趣信息的HTML内容和模式
测试您的RegEx以查看是否匹配。您可能需要多次执行（多遍解析/提取）
通过cURL编写客户端甚至更简单，使用file_get_contents（注意某些托管禁用使用file_get_contents加载URL）

对我来说，我最好使用Tidy转换为有效的XHTML，然后使用XPath提取数据，而不是RegEx。为什么？因为XHTML不规则而且XPath非常灵活。您可以学习XSLT进行转换。

Answer 6

从网站提取数据的最简单方法。我已经分析过我的所有数据都只包含在标签内，所以我已经准备好了这个。

<?php
    include(‘simple_html_dom.php’);
        // Create DOM from URL, paste your destined web url in $page 
        $page = ‘http://facebook4free.com/category/facebookstatus/amazing-facebook-status/’;
        $html = new simple_html_dom();

       //Within $html your webpage will be loaded for further operation
        $html->load_file($page);

        // Find all links
        $links = array();
        //Within find() function, I have written h3 so it will simply fetch the content from <h3> tag only. Change as per your requirement.
       foreach($html->find(‘h3′) as $element) 
        {
            $links[] = $element;
        }
        reset($links);
        //$out will be having each of HTML element content you searching for, within that web page
        foreach ($links as $out) 
        {
            echo $out;
        }                

?>

通过PHP从网站中提取数据

6 个答案: