使用转义序列提取双引号字符串

时间:2018-03-17 19:46:09

标签: php regex parsing

我有一些表格的文字:

This is some text, and here's some in "double quotes"
"and here's a double quote:\" and some more", "text that follows"

该文本包含双引号内的字符串,如上所示。双引号可以使用反斜杠(\)进行转义。在上面,有三个这样的字符串:

"double quotes"
"and here's a double quote:\" and some more"
"text that follows"

要提取这些字符串,我尝试了正则表达式:

"(?:\\"|.)*?"

然而,这给了我以下结果:

>>> preg_match_all('%"(?:\\"|.)*?"%', $msg, $matches)
>>> $matches
[
  [ "double quotes",
    "and here's a double quote:\",
    ", "
  ]
]

如何正确获取字符串?

3 个答案:

答案 0 :(得分:2)

这样做的一种方法是涉及到。 lookbehinds:

setwd("/folder/subfolder")

getwd()
"/folder/subfolder"

list.files()
"group.jpg"                   
"Stake.htm"

stargazer(model, out = "sampleOutput")
jpeg("sample.jpg")
plot(sample, horiz = F)
dev.off()

list.file()
"group.jpg"                   
"Stake.htm"

<小时/> ".*?(?<!\\)" 中的内容是:

PHP

<小时/> 这产生了

<?php

$text = <<<TEXT
This is some text, and here's some in "double quotes"
"and here's a double quote:\" and some more", "text that follows"
TEXT;

$regex = '~".*?(?<!\\\\)"~';

if (preg_match_all($regex, $text, $matches)) {
    print_r($matches);
}
?>

<小时/> 见a demo on regex101.com。 要让它跨越多行,请通过

启用Array ( [0] => Array ( [0] => "double quotes" [1] => "and here's a double quote:\" and some more" [2] => "text that follows" ) ) 模式
dotall

同样请参阅a demo for the latter on regex101.com

答案 1 :(得分:2)

如果你echo your pattern, you'll see it's indeed passed as %"(?:\"|.)*?"%到正则表达式解析器。即使是正则表达式解析器,单个反斜杠也将被视为转义字符。

因此,如果模式在单引号内,则需要添加至少一个反斜杠,以将两个反斜杠传递给解析器(一个用于转义backlsash),模式将为:%"(?:\\"|.)*?"%

preg_match_all('%"(?:\\\"|.)*?"%', $msg, $matches);

这仍然不是一个非常有效的模式。问题实际上似乎是duplicate of this one

有一个better pattern available in this answer(有些人称之为unrolled)。

preg_match_all('%"[^"\\\]*(?:\\\.[^"\\\]*)*"%', $msg, $matches);

See demo at eval.in或将步骤与其他模式in regex101进行比较。

答案 2 :(得分:1)

如果你让正则表达式捕获反斜杠字符作为字符,那么它将终止你的捕获组在&#34; of&#34; (因为前面的\被认为是单个字符)。所以你需要做的是允许\&#34;被捕获,但不是\或&#34;个别。结果是以下正​​则表达式:

"((?:[^"\\]*(?:\\")*)*)"

Try it here!

详细解释如下:

"                begin with a single quote character
(                capture only what follows (within " characters)
  (?:            don't break into separate capture groups
    [^"\\]*      capture any non-" non-\ characters, any number of times
    (?:\\")*     capture any \" escape sequences, any number of times
  )*             allow the previous two groups to occur any number of times, in any order
)                end the capture group
"                make sure it ends with a "

请注意,在许多语言中,当将正则表达式字符串提供给解析某些文本的方法时,您需要转义反斜杠字符,引号等。在PHP中,上述内容将变为:

'/"((?:[^"\\\\]*(?:\\\\")*)*)"/'