Question

我有一个由多行组成的大文件。我需要根据大小将文件分成块（比如1个文件到4个部分），但是我不需要在2个部分中断行（每行应该完全存在于一个块中）然后将这些块分配给每个要处理的线程和处理后我将再次重新组装所有块。主要是我想减少文件内容的处理时间（我在文件的文本中做了一些替换）。

解决这个问题的最佳方法是什么？我想到的是基于大小到达块的结束字节，如果结束字符不是行尾，则继续读取直到我得到行尾并存储该部分。

任何建议或更好的算法相同。感谢您的帮助。

编辑：

此外，整个内容都在变量中，我如何才能到达变量中的某个字节？

编辑：根据用户的建议，再使用正确的英语和问题声明进行编辑：

问题陈述：

我在perl中的变量（标量）中有一些数据（整个html页面内容）假设为$ str，数据由几行组成（约1762899行）我需要将标量中的数据分成较小的块（有一些来自原始的行），基于某些长度，如$ str1，$ str2，$ str3，$ str4，如果我加入这些var，我将得到完整的内容。

要求：

我需要上面的strs，所以我可以将它们交给线程，并且在完成所有线程后，我将加入所有这些以获取整个内容。

我的理解：

我将使用substr从char获取数据到char但是我需要确保我在substr中获得的最后一个char是新行字符。在这种情况下如何处理？

需要解决方案。感谢。

Answer 1

您可能希望使用此算法将源HTML拆分为许多相当相等的部分，在行边界上进行拆分。

我仍然担心使用这种任意拆分数据可能无法处理，但如果您遇到问题，则必须再次询问。

use strict;
use warnings;

my $html;
$html .= $_ x 10 . "\n" for 'A' .. 'Z';

use constant PARTITIONS => 4;

my @start;
push @start, $-[0] while $html =~ /^/gm;
push @start, length $html;
my $n = @start;
my @parts = map $start[$_ * ($n-1) / PARTITIONS], 0 .. PARTITIONS;

for my $i (0 .. $#parts-1) {
  my ($start, $size) = ($parts[$i], $parts[$i+1] - $parts[$i]);
  print substr $html, $start, $size;
  print '-' x 10 . "\n";
}

<强>输出

AAAAAAAAAA
BBBBBBBBBB
CCCCCCCCCC
DDDDDDDDDD
EEEEEEEEEE
FFFFFFFFFF
----------
GGGGGGGGGG
HHHHHHHHHH
IIIIIIIIII
JJJJJJJJJJ
KKKKKKKKKK
LLLLLLLLLL
MMMMMMMMMM
----------
NNNNNNNNNN
OOOOOOOOOO
PPPPPPPPPP
QQQQQQQQQQ
RRRRRRRRRR
SSSSSSSSSS
----------
TTTTTTTTTT
UUUUUUUUUU
VVVVVVVVVV
WWWWWWWWWW
XXXXXXXXXX
YYYYYYYYYY
ZZZZZZZZZZ
----------

Answer 2

一种天真（但可能足够有效）的解决方案：

fork 4子进程，逐行读取输入文件并将每一行发送到子进程。告诉子进程使用哪个文件名作为输出。

完成工作后，父进程可以再次聚合结果。

Answer 3

你的问题对我来说不够明确。还有一些建议。

您可以使用标准的unix工具，例如split --lines=10000。

如果您需要使用perl，则可以基于以下内容进行while分割：

open(my $fh, "<", "input.txt")
                       or die "cannot open < input.txt: $!";
while ( <$fh> ) {
    # controll count of lines you need and open/close new FH if needed...
    print $nfh $_;
}
close($fh);

关于你的编辑：你需要达到字节或字符吗？你的问题是关于文本和字符串，所以我假设你需要字符。然后，您可以使用substr。

Answer 4

我试图想出一个代码来解决它。请找到以下代码。

    #!/usr/bin/perl

    use strict;

    ### File contents to be broken in pieces ###
    open(FH, "<index.html");

    ### slurp whole file in scalar ###
    my $text = do { local $/; <FH> };

    ### Length of file ###
    my $length = length $text;
    print "length=$length\n";

    #### We will create 6 threads so divide it into 6 parts ###
    my $chunk_sz = int($length/6);
    print "chunk size=$chunk_sz\n";

    ### Lets have the chunks into some var and check the chunk end with proper new line char ###
    my $start = 0;
    my @res;

    for(my $i = 0; $i <= 5; $i++)
    {
        #print "start is : $start\n";
        my $chunk;
        my $var = 0;

        ### If it's last chunk, take all contents ###
        if($i == 5)
        {
            $chunk_sz = $length - $start;
            $chunk = substr($text, $start, $chunk_sz);
        }
        else
        {
            $chunk = substr($text, $start, $chunk_sz);
        }
        START:
        my $last_ch = chop($chunk);    ### If last char is not new line(\n) char find it and save the chunk ###

    while($last_ch !~ /\n/ && $i != 5)
    {
        $var += 1;
        $chunk = substr($text, $start, $chunk_sz+$var);
        goto START;
    }
    ### Start from the last chunk char + 1 ###
    $start += $chunk_sz+$var+1;
    $res[$i] = $chunk."\n";
}

## Further code to process the chunk in threads goes here ###

有任何改进或更正的建议吗？

Answer 5

这个答案对这个用户可能没什么用处，但是我一直在寻找将一百万行文件分成多个100K行文件的perl代码。阅读了多篇帖子并反复试验后，我得到了这段代码。如果你喜欢，请竖起大拇指！

#!/bin/perl -s
#
# $Header$
# $Log$
use File::Basename;
use File::stat;
use English;
use Time::Local;
use Data::Dumper;
use IO::Handle;
use Fcntl;                             # For O_RDWR, O_CREAT, etc.
use POSIX qw(strftime);
use bigint;
use strict;

$\ = "\n";    # set output record separator

print "Starting program ...";



#
#  Get the interface directory path
#
my $ScriptName = $0;
my $ScriptDirPath = `dirname $ScriptName`;
chop($ScriptDirPath);


my $LOAD_INP_FILE = $ScriptDirPath . "03g_loadInp.txt";
my $LOAD_CHUNK_FILE = $ScriptDirPath . "04g_loadInp_00000000.txt";

my $source = $LOAD_INP_FILE;
my $lines_per_file = 100000;

open (my $FH, "<$source") or die "Could not open source file. $!";
open (my $OUT, ">$LOAD_CHUNK_FILE") or die "Could not open destination fil
+e. $!";

#this is line counter
my $i = 0;

print "Creating new $LOAD_CHUNK_FILE ...";

my $line;
while ($line = <$FH> ) {
    chop $line;
    print $OUT $line;
    $i++;

    if ($i % $lines_per_file == 0) {
        close($OUT);
        my $FHNEW = sprintf("%08d", $i);
        my $LOAD_CHUNK_FILE_NEW = $ScriptDirPath . "04g_loadInp_${FHNEW}.txt";
        open ($OUT, ">$LOAD_CHUNK_FILE_NEW") or die "Could not open destinatio
+n file. $!";
        print "Creating new $LOAD_CHUNK_FILE_NEW ...";
    }
}


print "Ending program ...";
exit 0;

#
#  End of Main Program
#

perl将文件分成块或块

5 个答案: