在Python中修改大文本文件的最后一行的最有效方法

时间:2015-11-19 18:31:02

标签: python io

我需要更新一些超过2GB文件的最后一行,这些文件由readlines()无法读取的文本行组成。目前,它通过逐行循环工作正常。但是,我想知道是否有任何编译库可以更有效地实现这一点?谢谢!

目前的方法

    myfile = open("large.XML")
    for line in myfile:
        do_something()

2 个答案:

答案 0 :(得分:6)

如果这确实是基于行的(真正的XML解析器不是最佳解决方案),mmap可以在这里提供帮助。

mmap该文件,然后在生成的对象上调用.rfind('\n')(可能需要调整以处理以换行符结尾的文件,当你真的想要它之前的非空行,而不是空的& #34;行"跟随它)。然后,您可以单独切出最后一行。如果需要在适当的位置修改文件,可以调整文件大小以削减(或添加)与您切片的行和新行之间的差异相对应的多个字节,然后写回新行。避免读取或写入超出您需要的文件。

示例代码(如果我犯了错误,请发表评论):

import mmap

# In Python 3.1 and earlier, you'd wrap mmap in contextlib.closing; mmap
# didn't support the context manager protocol natively until 3.2; see example below
with open("large.XML", 'r+b') as myfile, mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
    # len(mm) - 1 handles files ending w/newline by getting the prior line
    # + 1 to avoid catching prior newline (and handle one line file seamlessly)
    startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1

    # Get the line (with any newline stripped)
    line = mm[startofline:].rstrip(b'\r\n')

    # Do whatever calculates the new line, decoding/encoding to use str
    # in do_something to simplify; this is an XML file, so I'm assuming UTF-8
    new_line = do_something(line.decode('utf-8')).encode('utf-8')

    # Resize to accommodate the new line (or to strip data beyond the new line)
    mm.resize(startofline + len(new_line))  # + 1 if you need to add a trailing newline
    mm[startofline:] = new_line  # Replace contents; add a b"\n" if needed

显然在没有mremap的某些系统(例如OSX)上,mm.resize不能工作,所以为了支持这些系统,你可能会拆分with(所以mmap在文件对象之前关闭),并使用基于文件对象的搜索,写入和截断来修复文件。以下示例包括我之前提到的Python 3.1和早期的特定调整,以使用contextlib.closing来完成:

import mmap
from contextlib import closing

with open("large.XML", 'r+b') as myfile:
    with closing(mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE)) as mm:
        startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
        line = mm[startofline:].rstrip(b'\r\n')
        new_line = do_something(line.decode('utf-8')).encode('utf-8')

    myfile.seek(startofline)  # Move to where old line began
    myfile.write(new_line)  # Overwrite existing line with new line
    myfile.truncate()  # If existing line longer than new line, get rid of the excess

mmap优于任何其他方法的优点是:

  1. 无需再读取超出行本身的文件(意味着文件的1-2页,其余部分永远不会被读取或写入)
  2. 使用rfind意味着您可以让Python在C层快速查找换行符(在CPython中);文件对象的显式seekread可以匹配"只读取一个页面左右",但您必须手动执行搜索换行
  3. 警告: 此方法无法正常工作(至少,无需修改以避免映射超过2 GB,并在整个文件可能不是时进行调整大小)映射)如果您使用的是32位系统并且文件太大而无法映射到内存。在大多数32位系统中,即使是在新生成的进程中,您也只有1-2 GB的连续可用地址空间;在某些特殊情况下,您可能拥有多达3-3.5 GB的用户虚拟地址(尽管您将丢失堆,堆栈,可执行映射等的一些连续空间)。 mmap不需要太多物理RAM,但它需要连续的地址空间; 64位操作系统的巨大好处之一就是你不用担心除了最荒谬的情况之外的所有虚拟地址空间,所以mmap可以解决一般情况下无法添加它无法处理的问题32位操作系统的复杂性。此时大多数现代计算机都是64位,但如果你的目标是32位系统,那么它肯定要记住(在Windows上,即使操作系统是64位,它们也可能安装了错误的32位版本的Python,所以同样的问题适用)。这是另一个有效的示例(假设最后一行不超过100 MB)在32位Python上(省略closing并且为了简洁而导入),即使对于大文件也是如此:

    with open("large.XML", 'r+b') as myfile:
        filesize = myfile.seek(0, 2)
        # Get an offset that only grabs the last 100 MB or so of the file aligned properly
        offset = max(0, filesize - 100 * 1024 ** 2) & ~(mmap.ALLOCATIONGRANULARITY - 1)
        with mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE, offset=offset) as mm:
            startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
            # If line might be > 100 MB long, probably want to check if startofline
            # follows a newline here
            line = mm[startofline:].rstrip(b'\r\n')
            new_line = do_something(line.decode('utf-8')).encode('utf-8')
    
        myfile.seek(startofline + offset)  # Move to where old line began, adjusted for offset
        myfile.write(new_line)  # Overwrite existing line with new line
        myfile.truncate()  # If existing line longer than new line, get rid of the excess
    

答案 1 :(得分:2)

更新:使用ShadowRanger's answer。它更短更健壮。

后人:

读取文件的最后N个字节,然后向后搜索换行符。

#!/usr/bin/env python

with open("test.txt", "wb") as testfile:
    testfile.write('\n'.join(["one", "two", "three"]) + '\n')

with open("test.txt", "r+b") as myfile:
    # Read the last 1kiB of the file
    # we could make this be dynamic, but chances are there's
    # a number like 1kiB that'll work 100% of the time for you
    myfile.seek(0,2)
    filesize = myfile.tell()
    blocksize = min(1024, filesize)
    myfile.seek(-blocksize, 2)
    # search backwards for a newline (excluding very last byte
    # in case the file ends with a newline)
    index = myfile.read().rindex('\n', 0, blocksize - 1)
    # seek to the character just after the newline
    myfile.seek(index + 1 - blocksize, 2)
    # read in the last line of the file
    lastline = myfile.read()
    # modify last_line
    lastline = "Brand New Line!\n"
    # seek back to the start of the last line
    myfile.seek(index + 1 - blocksize, 2)
    # write out new version of the last line
    myfile.write(lastline)
    myfile.truncate()