PHP - 如何通过逐行读取文本文件中的块来

时间:2016-03-15 17:24:11

标签: php slice fgets

我有一个输入文本文件,如下所示:

BEGIN
#1 
#2 
#3 
#4 
#5 
#6 
1       2015-05-31  2001-11-24  'Name Surname'      ID_1        0 
2       2011-04-01  ?           ?                   ID_2        1 
2       2013-02-24  ?           ?                   ID_3        1 
2       2014-02-28  ?           'Name Surname'      ID_4        2 
END
#7      'value 1'
#8      'value 2'
#9      'value 3'
#10     'value 4'
END

当在文本文件中有BEGIN时,从那里开始循环,其中以#开头的每一行都是一个键,而相对值是每个后续行的列,直到END,生成如下数组:

Array ( [#1] => Array ( [0] => 1 [1] => 2 [2] => 2 [3] => 2 ) [#2] => Array ( [0] => 2015-05-31 [1] => 2011-04-01 [2] => 2013-02-24 [3] => 2014-02-28 ) [#3] => Array ( [0] => 2001-11-24 [1] => ? [2] => ? [3] => ? ) [#4] => Array ( [0] => 'Name Surname' [1] => ? [2] => ? [3] => 'Name Surname' ) [#5] => Array ( [0] => ID_1 [1] => ID_2 [2] => ID_3 [3] => ID_4 ) [#6] => Array ( [0] => 0 [1] => 1 [2] => 1 [3] => 2 ) )

否则,如果在文本文件中没有BEGIN,但是您找到以#开头的行,则其相对值是单引号之间的值,生成类似的数组以下内容:

Array ( [#7] => 'value 1' [#8] => 'value 2' [#9] => 'value 3' [#10] => 'value 4' )

这是我要获得的,我目前的代码如下:

<?php
    $time = microtime();
    $time = explode(' ', $time);
    $time = $time[1] + $time[0];
    $start = $time;

    ini_set("max_execution_time", 300); // 300 seconds = 5 minutes
    ini_set("pcre.backtrack_limit", "100000000"); // default 100k = "100000"
    ini_set("memory_limit", "1024M");

    $txt_path = "./test_2.txt";
    $txt_data = @file_get_contents($txt_path) or die("Could not access file: $txt_path");
    //echo $txt_data;

    /* BEGIN ARRAY FOR LOOP ENTRIES */

    $loop_pattern = "/BEGIN(.*?)END/s";
    preg_match_all($loop_pattern, $txt_data, $matches);
    $loops = $matches[0];
    $loops_count = count($loops);
    //echo("<br><br>".$loops_count."<br><br>");

    foreach ($loops as $key => $value) {
        $value = trim($value);
        $pattern = array("/BEGIN(.*?)/", "/END(.*?)/", "/[[:blank:]]+/");
        $replacement = array("", "", " ");
        $value = preg_replace($pattern, $replacement, $value);
        //echo $value."<br><br>";

        preg_match_all( '/^#\d+/m', $value, $matches );
        $keys = $matches[0];
        //print_r($keys);
        //echo "<br><br>";

        $value = preg_replace( '/^#\d+\s*/m', '', $value );

        $value = str_replace( "\n", " ", $value );

        $pattern = '/'.str_repeat( "('[^']+'|\S+)\s+", count( $keys ) ).'/';

        preg_match_all( $pattern, $value, $matches );
        //print_r($matches);
        //echo "<br><br>";

        $loop_dic = array_combine( $keys, array_slice( $matches, 1 ) );

        print_r( $loop_dic );
        echo("<br><br>");
    }

    /* END ARRAY FOR LOOP ENTRIES */

    /* BEGIN ARRAY FOR NO LOOP ENTRIES */

    $txt_data_without_loops = preg_replace( "/BEGIN(.*?)END/s", "", $txt_data );
    //echo $txt_data_without_loops;

    $pattern = array("/END(.*?)/", "/[[:blank:]]+/");
    $replacement = array("", " ");
    $txt_data_without_loops_clean = preg_replace($pattern, $replacement, $txt_data_without_loops);
    //echo $txt_data_without_loops_clean;
    preg_match_all( '/^#(.*?)\S+/m', $txt_data_without_loops_clean, $matches );
    $keys = $matches[0];
    //print_r($keys);
    $txt_data_without_loops_clean = preg_replace( '/^#(.*?)\S+\s*/m', '', $txt_data_without_loops_clean );
    //print_r($txt_data_without_loops_clean);

    $txt_data_without_loops_clean_no_newline = str_replace( "\n", " ", $txt_data_without_loops_clean );
    //print_r($txt_data_without_loops_clean_no_newline);
    $pattern = '/'.str_repeat( "('[^']+'|\S+)\s+", 1 ).'/';
    preg_match_all( $pattern, $txt_data_without_loops_clean_no_newline, $matches );
    //print_r( $matches[0] );

    $no_loop_dic = array_combine( $keys, $matches[0] );
    print_r( $no_loop_dic );
    echo("<br><br>");

    /* END ARRAY FOR NO LOOP ENTRIES */

    $time = microtime();
    $time = explode(' ', $time);
    $time = $time[1] + $time[0];
    $finish = $time;
    $total_time = round(($finish - $start), 4);
    echo '<br><br><b>Page generated in '.$total_time.' seconds.</b><br><br>';
?>

作为第一种方法,为了获得BEGIN-END循环和相关数组,我用以下内容读取输入文件:

$txt_path = "./input.txt";
$txt_data = @file_get_contents($txt_path) or die("<b>Could not access file: $txt_path</b><br><br>");

适用于小文件,但是对于大输入文件,它会在浏览器中生成无响应时间(我在Firefox上测试),也许是为了饱和RAM来解析整个大文件(我的笔记本电脑)有3GB的RAM。

我在php文件中尝试了以下设置:

ini_set("max_execution_time", 300); // 300 seconds = 5 minutes
ini_set("pcre.backtrack_limit", "100000000"); // default 100k = "100000"
ini_set("memory_limit", "1024M");

似乎解决了一些文件没有那么大的问题,而对于大文件,这个过程已经完成而没有错误,只是在同一时刻没有使用很多资源......所以,这不是最好的解决方案。

在网上搜索,我发现this page在哪里读到:

  

如果您正在阅读文件,请逐行阅读而不是阅读   将完整的文件存入内存。看看fgets和   SplFileObject::fgets

所以我决定使用fgets来读取和解析整个输入文件。 在为所有行生成数组之后,我需要从中提取每个循环,将其添加到loops_array,而我将其他no_loop键值耦合到另一个数组。

我的想法,似乎很快,就是以这种方式找到每个BEGIN的索引:

$txt_path = "./test.txt";
$txt_data = @fopen($txt_path, "rb") or die("<b>Could not access file: $txt_path</b><br/><br/>");

$lines = array();
while ( !feof($txt_data) ) {
    $line = fgets($txt_data, 1024);
    //echo($line."<br/><br/>");
    array_push($lines, trim($line));
}

$lines = array_filter($lines);
//print_r($lines);
//echo("<br/><br/>");

$begins = array_keys($lines, "BEGIN");
//echo("<b>Begins:</b><br/><br/>");
//print_r($begins);
//echo("<br/><br/>");

但现在我需要找到END数组中每个元素之后的第一个$begins的索引...如果我这样做:

$ends = array_keys($lines, "END");
//echo("<b>Ends:</b><br/><br/>");
//print_r($ends);
//echo("<br/><br/>");

它还考虑输入文件的no_loop区域中的END字符串,而我应该在每个END之后找到BEGIN字符串的第一个匹配的索引,组合然后他们:

$begins_ends = array_combine($begins, $ends);

并使用array_slice提取所有循环,最后将每个$loop添加到新数组$loops,方式如下:

$i = 0;
$loops = array();
foreach ($begins_ends as $key => $value) {
    $begin = trim($key);
    $end = trim($value);
    $loop = array_slice( $lines, $begin, ($end - $begin), false );
    $this_loop = array();
    for ($el=$begin; $el < $end+1; $el++) {
        array_push($this_loop, $lines[$el]);
        unset($lines[$el]);
    }
    array_push($loops, $this_loop);
    $loop = array_values($lines);
    //echo("<b>Loops Dictionary $i:</b><br/><br/>");
    //print_r($loop);
    //echo("<br/><br/>");
    $i++;
}
//print_r($loops);
//echo("<br/><br/>");

问题是获取正确的$ends数组,而不考虑输入文件中no_loop区域的END字符串,获取先前的输出:

Array ( [#1] => Array ( [0] => 1 [1] => 2 [2] => 2 [3] => 2 ) [#2] => Array ( [0] => 2015-05-31 [1] => 2011-04-01 [2] => 2013-02-24 [3] => 2014-02-28 ) [#3] => Array ( [0] => 2001-11-24 [1] => ? [2] => ? [3] => ? ) [#4] => Array ( [0] => 'Name Surname' [1] => ? [2] => ? [3] => 'Name Surname' ) [#5] => Array ( [0] => ID_1 [1] => ID_2 [2] => ID_3 [3] => ID_4 ) [#6] => Array ( [0] => 0 [1] => 1 [2] => 1 [3] => 2 ) )

Array ( [#7] => 'value 1' [#8] => 'value 2' [#9] => 'value 3' [#10] => 'value 4' )
采用最快的方法和最低的内存使用率来解决浏览器中响应时间过大的文件。

谢谢

1 个答案:

答案 0 :(得分:0)

说没有必要使用fgets(),只有fread()这是很有用的。信息来源是here

正如您可以在那里阅读的那样,file()与之前使用的file_get_contents()非常相似,所以它不应该有所作为。

以前的工作代码应该以一种简单的方式进行调整:

  • test_2.txt文件内容:
BEGIN
#1 
#2 
#3 
#4 
#5 
#6 
1       2015-05-31  2001-11-24  'Name Surname'      ID_1        0 
2       2011-04-01  ?           ?                   ID_2        1 
2       2013-02-24  ?           ?                   ID_3        1 
2       2014-02-28  ?           'Name Surname'      ID_4        2 
END
#7      'value 1'
#8      'value 2'
#9      'value 3'
#10     'value 4'
END
BEGIN
#11 
#12 
#13 
#14 
#15 
#16 
1       2015-05-31  2001-11-24  'Name Surname'      ID_5        0 
2       2011-04-01  ?           ?                   ID_6        1 
2       2013-02-24  ?           ?                   ID_7        1 
2       2014-02-28  ?           'Name Surname'      ID_8        2 
END
BEGIN
#17 
#18 
#19 
#20 
#21 
#22 
1       2015-05-31  2001-11-24  'Name Surname'      ID_9        0 
2       2011-04-01  ?           ?                   ID_10        1 
2       2013-02-24  ?           ?                   ID_11        1 
2       2014-02-28  ?           'Name Surname'      ID_12        2 
END
  • PHP代码:
<?php
$time = microtime();
$time = explode(" ", $time);
$time = $time[1] + $time[0];
$start = $time;

$filename = "./test_2.txt";
$handle = fopen($filename, "rb") or die("<b>Could not access file: $filename</b><br/><br/>");
$contents = fread($handle, filesize($filename));
fclose($handle);

//echo($contents."<br><br>");

$loop_pattern = "/BEGIN(.*?)END/s";
preg_match_all($loop_pattern, $contents, $matches);
$loops = $matches[0];
//print_r($loops);
//echo("<br><br>");
$loops_count = count($loops);
//print_r($loops_count);
//echo "<br><br>";

foreach ($loops as $key => $value) {
    $value = trim($value);
    //echo($value."<br><br>");
    $pattern = array("/[[:blank:]]+/", "/BEGIN(.*)/", "/END(.*)/");
    $replacement = array(" ", "", "");
    $value = preg_replace($pattern, $replacement, $value);
    //echo($value."<br><br>");

    preg_match_all( '/^#\d+/m', $value, $matches );
    $keys = $matches[0];
    //print_r($keys);
    //echo "<br><br>";

    $value = preg_replace( '/^#\d+\s*/m', '', $value );

    $value = str_replace( "\n", " ", $value );

    $pattern = '/'.str_repeat( "('[^']+'|\S+)\s+", count( $keys ) ).'/';
    preg_match_all( $pattern, $value, $matches );
    //print_r($matches);
    //echo "<br><br>";

    $values = array_combine( $keys, array_slice( $matches, 1, count( $keys ), false ) );
    print_r( $values );
    echo "<br><br>";
}

$time = microtime();
$time = explode(" ", $time);
$time = $time[1] + $time[0];
$finish = $time;
$total_time = round(($finish - $start), 4);
echo("<br/><br/><b>Page generated in ".$total_time." seconds.</b><br/><br/>");
?>

我还删除了@,写道:

fopen($filename, "rb") or die("<b>Could not access file: $filename</b><br/><br/>");

而不是之前的:

@fopen($txt_path, "rb") or die("<b>Could not access file: $txt_path</b><br/><br/>");

建议here

编辑1

另一种方法如下:

$txt_path = "./test_2.txt";
$handle = new SplFileObject($txt_path);

// Loop until we reach the end of the file.
$lines_array = array();
while ( !$handle->eof() ) {
    $line = $handle->fgets();
    //echo($line."<br/><br/>"); // Echo one line from the file.
    array_push($lines_array, trim($line));
}

// Unset the file to call __destruct(), closing the file handle.
$handle = null;

$lines_array = array_filter($lines_array);
//print_r($lines_array);
//echo("<br/><br/>");

$lines_joined = implode("\n", $lines_array);
//echo($lines_joined."<br/><br/>");