Question

我正在开发一个程序来修改mysqldump生成时的输出，我目前有一些代码以块的形式读取mysqldump的输出，其大小为固定的字节数。我需要能够进行正则表达式匹配，以及正则表达式的正则表达式替换（在最终文本输出上运行正则表达式是不可能的，因为最终文件大小是几千兆字节）。我用PHP编写代码，但我相信问题（以及它的解决方案）应该与语言无关。

现在我看起来像这样的伪代码：

$previous_chunk = "";
while (!end_of_file($reader)) {
    $chunk = $reader.read() //read in a few thousand characters from the file
    $double_chunk = $previous_chunk + $chunk;
    // do regular expressions on the double chunk (to catch matches that span the chunk boundary)
    $output_file.write($chunk);
    $previous_chunk = $chunk;
}

对两个问题搁浅。第一个是正则表达式正在对每个块进行两次评估，因此如果在块中发生匹配（而不是跨越块边界），即使匹配文本只出现一次，它也会触发该匹配两次。第二个问题是，这仍然不允许我对比赛进行替换。正则表达式将替换$double_chunk中的文本，但我只将$chunk写入输出文件，该文件不受替换的影响。

我有一个想法是，我不希望我的任何正则表达式需要跨越多行（由\n个字符分隔），所以我可以在程序中创建第二个缓冲区，运行正则表达式仅在已完成的行上，然后逐行写入目标文件而不是块。不幸的是，由于mysqldump的输出性质，有一些非常长的行（有些是几百兆字节），所以我不认为这是一个可行的选择。

我怎样才能在这个文件中读取一些合理大小的内存（比如几十MB）并使用正则表达式在流中修改它？

Answer 1

$chunk = $reader.read() //read in exactly $chunk_length characters from the file (or less iff EOF reached)
while (!end_of_file($reader)) {
    $previous_chunk = $chunk;
    $chunk = $reader.read() //read in $chunk_length characters from the file (or less iff EOF reached)

    $double_chunk = $previous_chunk + $chunk;
    // do regular expressions on the double chunk (to catch matches that span the chunk boundary)
    $previous_chunk = substr($double_chunk, 0, $chunk_length);
    $chunk = substr($double_chunk, $chunk_length);
    $output_file.write($previous_chunk);
}

// do regular expressions on $chunk to process the last one (or the first and only one)
$output_file.write($chunk);

问题1＆amp; 2通过执行正则表达式替换解决，然后将结果字符串块分配回$ previous_chunk和$ chunk，假设您用作替换字符串的内容不会重新触发匹配。这会将write更改为使用$ previous_chunk，这样就可以在下次抓住大块跨越匹配的机会时更改$ chunk。

，重要的，上面假设替换的长度与被替换的字符串的长度相同。如果没有，则在替换之后块大小会动态变化，并且上述解决方案太天真无法处理它。如果替换字符串的长度不同，那么您必须以某种方式跟踪更改的块边界。

替换缓冲字符串中的文本

1 个答案: