解析短语和关键字的搜索字符串

时间:2011-10-30 05:13:23

标签: php regex string parsing

我需要在php中解析关键字和短语的搜索字符串,例如

字符串1:value of "measured response" detect goal "method valuation" study

将产生:value,of,measured reponse,detect,goal,method valuation,study

如果字符串有:

,我也需要它才能工作
  1. 没有用引号括起来的短语,
  2. 任意数量的短语用引号括起来,引号外有任意数量的关键字,
  3. 仅引号中的短语,
  4. 仅以空格分隔的关键字。
  5. 我倾向于使用带有preg_match模式的'/(\".*\")/'将短语放入数组中,然后从字符串中删除短语,最后将关键字放入数组中。我不能把所有东西拉到一起!

    我也在考虑用逗号替换引号之外的空格。然后将它们分解为数组。如果这是一个更好的选择,我如何使用preg_replace

    还有更好的方法吗?救命!非常感谢大家

3 个答案:

答案 0 :(得分:10)

preg_match_all('/(?<!")\b\w+\b|(?<=")\b[^"]+/', $subject, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
    # Matched text = $result[0][$i];
}

这应该会产生您正在寻找的结果。

说明:

# (?<!")\b\w+\b|(?<=")\b[^"]+
# 
# Match either the regular expression below (attempting the next alternative only if this one fails) «(?<!")\b\w+\b»
#    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!")»
#       Match the character “"” literally «"»
#    Assert position at a word boundary «\b»
#    Match a single character that is a “word character” (letters, digits, etc.) «\w+»
#       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
#    Assert position at a word boundary «\b»
# Or match regular expression number 2 below (the entire match attempt fails if this one fails to match) «(?<=")\b[^"]+»
#    Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=")»
#       Match the character “"” literally «"»
#    Assert position at a word boundary «\b»
#    Match any character that is NOT a “"” «[^"]+»
#       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»

答案 1 :(得分:2)

$s = 'value of "measured response" detect goal "method valuation" study';
preg_match_all('~(?|"([^"]+)"|(\S+))~', $s, $matches);
print_r($matches[1]);

输出:

Array
(
    [0] => value
    [1] => of
    [2] => measured response
    [3] => detect
    [4] => goal
    [5] => method valuation
    [6] => study
)

这里的技巧是使用 branch-reset 组:(?|...|...)。它就像非捕获组中包含的交替 - (?:...|...) - 除了在每个分支内,捕获组编号从相同的数字开始。 (有关详细信息,请参阅PCRE docs并搜索DUPLICATE SUBPATTERN NUMBERS。)

因此,我们感兴趣的文本总是被捕获的组#1。您可以通过$matches[1]检索所有匹配的组#1的内容。 (假设PREG_PATTERN_ORDER标志已设置;我没有像@FailedDev那样指定它,因为它是默认值。有关详细信息,请参阅PHP docs。)

答案 2 :(得分:1)

不需要使用正则表达式,内置函数str_getcsv可用于爆炸任何给定分隔符,封闭和转义字符的字符串。

真的很简单。

// where $string is the string to parse
$array = str_getcsv($string, ' ', '"');