Question

我正在尝试使用php脚本从网页中提取产品的价格。有问题的字符串包含以下html：

<div class="pd_warranty col-xs-12 no-padding">
    <p class="selectWty txtLeft">Available Options</p>
    <div class="vspace clear"></div>

<div class="subProd col-xs-4 noPadLR">
    <a href="https://www.example.com/single” class="selected">
        <div class="col-xs-12 cellTable pad5All">
            <div class="col-xs-8 noPadLR cellTableCell">
                <p class="noMar txtLeft">Single</p>
                <p class="noMar txtLeft sml">$99.99</p>
            </div>
        </div>
    </a>
</div>

<div class="subProd col-xs-4 noPadLR">
    <a href="https://www.example.com/2pack” class="">
        <div class="col-xs-12 cellTable pad5All">
            <div class="col-xs-8 noPadLR cellTableCell">
                <p class="noMar txtLeft">2-PACK</p>
                <p class="noMar txtLeft sml">$159.99</p>
            </div>
        </div>
    </a>
</div>

<div class="subProd col-xs-4 noPadLR">
    <a href="https://www.example.com/4pack” class="">
        <div class="col-xs-12 cellTable pad5All">
            <div class="col-xs-8 noPadLR cellTableCell">
                <p class="noMar txtLeft">4-PACK</p>
                <p class="noMar txtLeft sml">$249.99</p>
            </div>
        </div>
    </a>
</div>

</div>

大多数产品有三组价格：单 2-PACK 4-PACK

某些网页可能没有2-PACK或4-PACK中的一个或两个。

我尝试编写一个正则表达式，无法从带有上述字符串的变量中提取我需要的信息。我正在尝试制作一个php正则表达式来提取单个/ 2-pack / 4-pack单词并在数组[type] [price]中定价，以表示html中是否存在每种类型的价格。

非常感谢任何有关正则表达式的帮助。

Answer 1

请注意，使用正则表达式解析html是脆弱的，并且会破坏html更改的大部分时间。你需要经常在与匹配过于具体和过于开放之间妥协。

这是：

$pattern = '#<div class="subProd.*?<p class="noMar[^>]+>(?P<product>[^<]+).*?<p class="noMar[^>]+>(?P<price>[^<]+)<#smi';
if (preg_match_all($pattern, $html, $matches)) {
    $products = array_combine($matches['product'], $matches['price']);

    var_dump($products);
}

将转储：

array(3) {
   ["Single"]=> string(6) "$99.99"
   ["2-PACK"]=> string(7) "$159.99"
   ["4-PACK"]=> string(7) "$249.99"
}

模式解释：

#是模式分隔符。
<div class="subProd将按字面意思匹配字符串。
.*?会随时匹配任何字符，但不会是gready。这意味着它将匹配最短的字符串，直到模式的下一个匹配部分。
<p class="noMar将按字面意思匹配字符串。
[^>]+>是一个角色组。它会匹配任何字符，但>至少一次，直到找到>。
(?P<product>[^<]+)是一个命名的捕获组（在()内）。这使得您的匹配在product之后的$matches密钥下可用。它会匹配任何字符，但<至少一次。
.*?任何不准确的角色。
<p class="noMar文字字符串。
[^>]+>除>

>

(?P<price>[^<]+)<任何字符，但<直到<。将在<组中捕获price之前的部分。

Answer 2

有很多方法可以自定义xpath和迭代节点处理，但这对您的示例字符串有效。您可以根据需要优化此解决方案或多或少。

（Jakub强迫我发布这个答案，因为我不希望你不得不诉诸正则表达式。）

代码：（Demo）

$dom = new DOMDocument; 
$dom->loadHTML(str_replace ('”', '"', $html));  // normalize the quoting; extend as needed
$xpath = new DOMXPath($dom);
//                        actually targeting this div ---------vvv
foreach ($xpath->evaluate("//div[contains(@class, 'subProd')]//div[contains(p/@class, 'noMar')]") as $div) {
    $type = $xpath->query("p[contains(@class, 'noMar') and not(contains(@class, 'sml'))]", $div)[0]->nodeValue;
    $price = $xpath->query("p[contains(@class, 'noMar') and contains(@class, 'sml')]", $div)[0]->nodeValue;
    $result[$type] = $price;
}
var_export($result);

输出：

array (
  'Single' => '$99.99',
  '2-PACK' => '$159.99',
  '4-PACK' => '$249.99',
)

解释......

foreach()的输入定位到包含一个或多个子节点的div 使用类属性noMar。对于在HTML中找到的每个符合条件的div ...

type文字，如果从p元素中提取的文字类型为noMar而非sml
price文字，如果从p元素中提取的文字具有noMar和sml

我将提取的数据存储为一维关联数组。

正则表达式帮助从HTML中提取价格

2 个答案: