Question

我有以下问题：我有一个近500mb的文件。它的文字，全部在一行。文本用虚线结尾分隔，名为ROW_DEL，文本如下：

this is a line ROW_DEL and this is a line

现在我需要进行以下操作，我想将此文件拆分为其行，以便我得到这样的文件：

this is a line
and this is a line

问题，即使我用Windows文本编辑器打开它，它也会破坏，因为文件很大。

是否有可能像我在C＃，Java或Python中提到的那样拆分此文件？什么是最好的灵魂，不要过度使用我的CPU。

Answer 1

以块的形式读取此文件，例如在c＃中使用StreamReader.ReadBlock。您可以设置要在那里读取的最大字符数。

对于每个已知的块，您可以将ROW_DEL替换为\r\n并将其附加到新文件中。

请记住将当前指数增加刚刚读过的字符数。

Answer 2

实际上500mb的文字并不那么大，只是记事本很糟糕。你可能没有sed可用，因为你在Windows上，但至少尝试在python中的天真解决方案，我认为它将工作正常：

import os
with open('infile.txt') as f_in, open('outfile.txt', 'w') as f_out:
  f_out.write(f_in.read().replace('ROW_DEL ', os.linesep))

Answer 3

这是我的解决方案。
简单的原则（ŁukaszW.pl给了它），但如果想要照顾特殊情况（ŁukaszW.pl没有），代码就不那么容易了。

特殊情况是分隔符ROW_DEL在两个读取块中分割（如I4V所指出的），如果有两个连续的ROW_DEL，其中第二个在两个读取块中被分割，则更为微妙。

由于ROW_DEL比任何可能的新行（'\r'，'\n'，'\r\n'）都长，因此可以通过操作系统使用的换行符在文件中替换它。这就是我选择自己重写文件的原因为此，我使用模式'r+'，它不会创建新文件使用二进制模式'b'也是绝对必要的。

原则是读取一个块（在现实生活中，其大小将为262144）和 x 其他字符， x 是分隔符的长度 - 1.
然后检查分隔符是否存在于块的末尾+ x个字符如果它存在与否，则在执行ROW_DEL转换之前缩短或不缩短块，并在适当的位置重写。

裸体代码是：

text = ('The hospital roommate of a man infected ROW_DEL'
        'with novel coronavirus (NCoV)ROW_DEL'
        '—a SARS-related virus first identified ROW_DELROW_DEL'
        'last year and already linked to 18 deaths—ROW_DEL'
        'has contracted the illness himself, ROW_DEL'
        'intensifying concerns about the ROW_DEL'
        "virus's ability to spread ROW_DEL"
        'from person to person.')

with open('eessaa.txt','w') as f:
    f.write(text)

with open('eessaa.txt','rb') as f:
    ch = f.read()
    print ch.replace('ROW_DEL','ROW_DEL\n')
    print '\nlength of the text : %d chars\n' % len(text)

#==========================================

from os.path import getsize
from os import fsync,linesep

def rewrite(whichfile,sep,chunk_length,OSeol=linesep):
    if chunk_length<len(sep):
        print 'Length of second argument, %d , is '\
              'the minimum value for the third argument'\
              % len(sep)
        return

    x = len(sep)-1
    x2 = 2*x
    file_length = getsize(whichfile)
    with open(whichfile,'rb+') as fR,\
         open(whichfile,'rb+') as fW:
        while True:
            chunk = fR.read(chunk_length)
            pch = fR.tell()
            twelve = chunk[-x:] + fR.read(x)
            ptw = fR.tell()

            if sep in twelve:
                pt = twelve.find(sep)
                m = ("\n   !! %r is "
                     "at position %d in twelve !!" % (sep,pt))
                y = chunk[0:-x+pt].replace(sep,OSeol)
            else:
                pt = x
                m = ''
                y = chunk.replace(sep,OSeol)

            pos = fW.tell()
            fW.write(y)
            fW.flush()
            fsync(fW.fileno())

            if fR.tell()<file_length:
                fR.seek(-x2+pt,1)
            else:
                fW.truncate()
                break

rewrite('eessaa.txt','ROW_DEL',14)

with open('eessaa.txt','rb') as f:
    ch = f.read()
    print '\n'.join(repr(line)[1:-1] for line in ch.splitlines(1))
    print '\nlength of the text : %d chars\n' % len(ch)

要执行此操作，这是另一个始终打印消息的代码：

text = ('The hospital roommate of a man infected ROW_DEL'
        'with novel coronavirus (NCoV)ROW_DEL'
        '—a SARS-related virus first identified ROW_DELROW_DEL'
        'last year and already linked to 18 deaths—ROW_DEL'
        'has contracted the illness himself, ROW_DEL'
        'intensifying concerns about the ROW_DEL'
        "virus's ability to spread ROW_DEL"
        'from person to person.')

with open('eessaa.txt','w') as f:
    f.write(text)

with open('eessaa.txt','rb') as f:
    ch = f.read()
    print ch.replace('ROW_DEL','ROW_DEL\n')
    print '\nlength of the text : %d chars\n' % len(text)

#==========================================

from os.path import getsize
from os import fsync,linesep

def rewrite(whichfile,sep,chunk_length,OSeol=linesep):
    if chunk_length<len(sep):
        print 'Length of second argument, %d , is '\
              'the minimum value for the third argument'\
              % len(sep)
        return

    x = len(sep)-1
    x2 = 2*x
    file_length = getsize(whichfile)
    with open(whichfile,'rb+') as fR,\
         open(whichfile,'rb+') as fW:
        while True:
            chunk = fR.read(chunk_length)
            pch = fR.tell()
            twelve = chunk[-x:] + fR.read(x)
            ptw = fR.tell()

            if sep in twelve:
                pt = twelve.find(sep)
                m = ("\n   !! %r is "
                     "at position %d in twelve !!" % (sep,pt))
                y = chunk[0:-x+pt].replace(sep,OSeol)
            else:
                pt = x
                m = ''
                y = chunk.replace(sep,OSeol)
            print ('chunk  == %r   %d chars\n'
                   ' -> fR now at position  %d\n'
                   'twelve == %r   %d chars   %s\n'
                   ' -> fR now at position  %d'
                   % (chunk ,len(chunk),      pch,
                      twelve,len(twelve),m,   ptw) )

            pos = fW.tell()
            fW.write(y)
            fW.flush()
            fsync(fW.fileno())
            print ('          %r   %d long\n'
                   ' has been written from position %d\n'
                   ' => fW now at position  %d'
                   % (y,len(y),pos,fW.tell()))

            if fR.tell()<file_length:
                fR.seek(-x2+pt,1)
                print ' -> fR moved %d characters back to position %d'\
                       % (x2-pt,fR.tell())
            else:
                print (" => fR is at position %d == file's size\n"
                       '    File has thoroughly been read'
                       % fR.tell())
                fW.truncate()
                break

            raw_input('\npress any key to continue')


rewrite('eessaa.txt','ROW_DEL',14)

with open('eessaa.txt','rb') as f:
    ch = f.read()
    print '\n'.join(repr(line)[1:-1] for line in ch.splitlines(1))
    print '\nlength of the text : %d chars\n' % len(ch)

在处理块的末端时有一些微妙之处，以便检测ROW_DEL是否跨越两个块以及是否有两个ROW_DEL连续。这就是为什么我花了很长时间来发布我的解决方案：我最终不得不写fR.seek(-x2+pt,1)而不仅仅是fR.seek(-2*x,1)或fR.seek(-x,1)，如果 sep 跨越或not（2 * x在代码中是x2，ROW_DEL x和x2是6和12）。任何对此感兴趣的人都会通过更改if 'ROW_DEL' is in twelve部分中的代码来检查它。

读取一个非常大的单行文本文件并将其拆分

3 个答案: