Question

我有以下示例文字

<p>in <span class="nanospell-typo">der</span> <span class="nanospell-typo"><dreipc data-type="abbreviation" data-uid="41">DDR</dreipc></span> <span class="nanospell-typo">kollaborieren</span>, <span class="nanospell-typo">gibt</span> es</p>
<li>per Post an <strong>&lt;dreipc data-type="abbreviation" data-uid="48"&gt;someAbbreviation&lt;/dreipc&gt;, 10106 Berlin</strong> oder</li>

并遵循两种正则表达式模式：

/(?:\<dreipc\ )(?:[^\>]*)(?:data\-type\=\")(.*?)(?:\"\ data\-uid\=\")(.*?)(?:\>)(.*?)(?:\<\/dreipc\>)/
/(?:&lt;dreipc\ )(?:[^\>]*)(?:data\-type\=\")(.*?)(?:\"\ data\-uid\=\")(.*?)(?:&gt;)(.*?)(?:&lt;\/dreipc&gt;)/

第一个正则表达式适用于regex101.com和php。第二个匹配在regex101.com但不在php中。我不明白为什么。实际上我只需要第一个正则表达式，但是当有htmnlentities时我没有匹配。这就是为什么我包括第二个正则表达式模式。我也不想在我的字符串上使用html_entity_decode。字符串主要是很长，我不想解码可能需要的htmlentities。

我的php代码如下：

class MyClass {
    const DREIPC_REGEX = '/(?:\<dreipc\ )(?:[^\>]*)(?:data\-type\=\")(.*?)(?:\"\ data\-uid\=\")(.*?)(?:\>)(.*?)(?:\<\/dreipc\>)/';
    const DREIPC_REGEX_HTMLENTITIES = '/(?:&lt;dreipc\ )(?:[^\>]*)(?:data\-type\=\")(.*?)(?:\"\ data\-uid\=\")(.*?)(?:&gt;)(.*?)(?:&lt;\/dreipc&gt;)/';


    public static function pregMatchHTMLNode($string = '')
    {
        $result = [];
        preg_match_all(self::DREIPC_REGEX, $string, $matches, PREG_SET_ORDER, 0);
        preg_match_all(self::DREIPC_REGEX_HTMLENTITIES, $string, $matchesHtmlentities, PREG_SET_ORDER, 0);
        $matches = array_merge($matches, $matchesHtmlentities);

        ... doing some other things with matches
        return $result;
    }
}

所以最好的办法是让preg_match_all（）使用我的第二个模式。但是如何？

Answer 1

我的前端输出中的引号未编码，只有我的代码的括号。但是var_dump（）表明，引号是编码的。所以我将正则表达式模式更改为：

(?:&lt;dreipc\ )(?:[^\>]*)(?:data\-type\=&quot;)(.*?)(?:&quot;\ data\-uid\=\&quot;)(.*?)(?:&quot;&gt;)(.*?)(?:&lt;\/dreipc&gt;)

现在它有效。

正则表达式适用于regex101但不适用于php

1 个答案: