PHP正则表达式查找字符串直到

时间:2012-09-07 22:40:09

标签: php regex preg-match

我有一个网址抓取器设置,它工作正常。它抓取响应头中的doc的url,例如:

<script type='text/javascript' language='JavaScript'>
document.location.href = 'http\x3a\x2f\x2fcms.example.com\x2fd\x2fd\x2fworkspace\x2fSpacesStore\x2f61d96949-b8fb-43f1-adaf-0233368984e0\x2fFinancial\x2520Agility\x2520Report.pdf\x3fguest\x3dtrue'
</script>   

这是我的抓手脚本。

<?php

set_time_limit(0);
$target_url = $_POST['to'];
$html =file_get_contents($target_url);

$pattern = "/document.location.href = '([^']*)'/";
preg_match($pattern, $html, $matches, PREG_OFFSET_CAPTURE, 3);

$raw_url = $matches[1][0];
$eval_url = '$url = "'.$raw_url.'";';

eval($eval_url);
echo $url;

我们必须在我们的文档管理系统中添加一个变量,因此每个文档URL都需要?guest = url末尾的true。当我们这样做时,我的抓取器返回完整的URL并将其附加到文件名。所以我试着让它只抓住url,直到它达到/ guest = true。使用此代码:

<?php

set_time_limit(0);

$target_url = $_POST['to'];
$html =file_get_contents($target_url);

$pattern = "/document.location.href = '([^']*)\x3fguest\x3dtrue'/";

preg_match($pattern, $html, $matches, PREG_OFFSET_CAPTURE, 3);

$raw_url = $matches[1][0];
$eval_url = '$url = "'.$raw_url.'";';

eval($eval_url);
echo $url;

为什么它不会返回url直到?guest = true部分?又说为什么这不起作用?什么是修复?

2 个答案:

答案 0 :(得分:1)

这是解决方案。您将直接获得比赛,而不是分组。

set_time_limit(0);

$target_url = $_POST['to'];
$html = file_get_contents($target_url);

$pattern = '/(?<=document\.location\.href = \').*?(?=\\\\x3fguest\\\\x3dtrue)/';

preg_match($pattern, $html, $matches))

$raw_url = $matches[0];
$eval_url = '$url = "'.$raw_url.'";';

eval($eval_url);
echo $url;

您可以查看结果 here

你的正则表达式的问题在于你没有逃避字符串(.\)中你想要捕捉文学的某些字符。此外,您不需要使用PREG_OFFSET_CAPTURE3的偏移量。我猜您从this page上的示例中复制了这些值。

以下是正则表达式模式的解释:

# (?<=document\.location\.href = ').*?(?=\\x3fguest\\x3dtrue)
# 
# Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=document\.location\.href = ')»
#    Match the characters “document” literally «document»
#    Match the character “.” literally «\.»
#    Match the characters “location” literally «location»
#    Match the character “.” literally «\.»
#    Match the characters “href = '” literally «href = '»
# Match any single character that is not a line break character «.*?»
#    Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=\\x3fguest\\x3dtrue')»
#    Match the character “\” literally «\\»
#    Match the characters “x3fguest” literally «x3fguest»
#    Match the character “\” literally «\\»
#    Match the characters “x3dtrue” literally «x3dtrue»

此答案已经过编辑,以反映问题的更新。

答案 1 :(得分:0)

看起来你的正则表达式是错误的。您已将\?guest=true添加到正则表达式中,字面上匹配?guest=true

在您的示例响应标头中,它以\x3fguest\x3dtrue结尾,这是不同的。

尝试:

$pattern="/document.location.href = '([^']*)(\?|(\\x3f))guest(=|(\\x3d))true'/";

我只是替换了以下子表达式:

  • \?现在(\?|(\\x3f))?\x3f字面匹配
  • =现在(=|(\\x3d))=\x3d字面匹配

这样,如果使用?=的转义十六进制表示,它仍会正确匹配。