为什么file_object.tell()为不同位置的文件提供相同的字节?

时间:2017-05-15 19:15:59

标签: python python-2.7

刚开始使用python,我无法绕过基本的文件导航方法。

当我阅读tell()教程时,它指出它返回我当前坐在我文件上的位置(以字节为单位)。

我的理由是文件的每个字符都会加到字节坐标上,对吧?这意味着在新行之后,只是在\n字符上拆分的一串字符,我的字节坐标会改变......但这似乎是不正确的。

我在bash上生成一个快速的玩具文本文件

$ for i in {1..10}; do echo "@ this is the "$i"th line" ; done > toy.txt
$ for i in {11..20}; do echo " this is the "$i"th line" ; done >> toy.txt

现在我将遍历此文件并打印出行号,并在每个周期打印tell()调用的结果。 @用于标记一些分隔文件块的行,我想返回(见下文)。

我的猜测是for循环遍历文件对象 first ,到达它的结束,因此它始终保持不变。

这是玩具示例,在我的真正问题上,文件是Gigs的长度,并且通过应用相同的方法,我得到tell()的结果,我的图像反映了for循环迭代文件的块宾语。 它是否正确?能不能对我错过的概念有所了解?

我的最终目标是能够在文件中找到特定的坐标,然后并行处理来自分布式起点的这些巨大文件,而这些文件我无法以我筛选的方式进行监控。

os.path.getsize("toy.txt")
451

fa = open("toy.txt")
fa.seek(0) # let's double check
fa.tell()
count = 0
for line in fa:
    if line.startswith("@"):
        print line ,
        print "tell {} count {}".format(fa.tell(), count)
    else:
        if count < 32775:
            print line,
            print "tell {} count {}".format(fa.tell(), count)
    count += 1

输出:

@ this is the 1th line
tell 451 count 0
@ this is the 2th line
tell 451 count 1
@ this is the 3th line
tell 451 count 2
@ this is the 4th line
tell 451 count 3
@ this is the 5th line
tell 451 count 4
@ this is the 6th line
tell 451 count 5
@ this is the 7th line
tell 451 count 6
@ this is the 8th line
tell 451 count 7
@ this is the 9th line
tell 451 count 8
@ this is the 10th line
tell 451 count 9
this is the 11th line
tell 451 count 10
this is the 12th line
tell 451 count 11
this is the 13th line
tell 451 count 12
this is the 14th line
tell 451 count 13
this is the 15th line
tell 451 count 14
this is the 16th line
tell 451 count 15
this is the 17th line
tell 451 count 16
this is the 18th line
tell 451 count 17
this is the 19th line
tell 451 count 18
this is the 20th line
tell 451 count 19

2 个答案:

答案 0 :(得分:3)

您正在使用for循环逐行读取文件:

for line in fa:

档案通常不会这样做;你读了 blobs of data ,通常是chunk。为了让Python为您提供行,您需要阅读直到下一个换行符。只是,逐字节读取以找到换行符效率不高。

使用缓冲区;你读了一大块,然后找到那个块中的换行符,并为你找到的每一行产生一行。缓冲区耗尽后,您将读取一个新块。

你的文件不够大,不能读取多个块;它只有451字节小,而缓冲区通常以千字节为单位。如果您要创建一个更大的文件,您将在迭代时看到文件位置大步跳跃。

请参阅file.next documenationnext是负责在迭代时生成下一行的方法,for循环的作用):

  

为了使for循环成为循环文件行的最有效方式(一种非常常见的操作),next()方法使用隐藏的预读缓冲区。

如果您需要在循环播放时跟踪绝对文件位置,则在Windows上必须使用二进制模式(以防止发生换行),以及自己跟踪线路长度:

position = 0    
for line in fa:
    position += len(line)

另一种方法是使用io library;这是Python 3中用于处理文件的框架。 file.tell()方法会考虑缓冲区,即使在迭代时也会生成准确的文件位置

考虑到当您使用io.open()文本模式打开文件时,您将获得unicode个字符串。在Python 2中,如果必须有'rb'个字节串,则可以使用二进制模式(使用str打开)。实际上,只有在二进制模式下才能访问IOBase.tell(),在textmode中会抛出异常:

>>> import io
>>> fa = io.open("toy.txt")
>>> next(fa)
u'@ this is the 1th line\n'
>>> fa.tell()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: telling position disabled by next() call

在二进制模式下,您可以获得file.tell()的准确输出:

>>> import os.path
>>> os.path.getsize("toy.txt")
461
>>> fa = io.open("toy.txt", 'rb')
>>> for line in fa:
...     if line.startswith("@"):
...         print line ,
...         print "tell {} count {}".format(fa.tell(), count)
...     else:
...         if count < 32775:
...             print line,
...             print "tell {} count {}".format(fa.tell(), count)
...     count += 1
...
@ this is the 1th line
tell 23 count 0
@ this is the 2th line
tell 46 count 1
@ this is the 3th line
tell 69 count 2
@ this is the 4th line
tell 92 count 3
@ this is the 5th line
tell 115 count 4
@ this is the 6th line
tell 138 count 5
@ this is the 7th line
tell 161 count 6
@ this is the 8th line
tell 184 count 7
@ this is the 9th line
tell 207 count 8
@ this is the 10th line
tell 231 count 9
 this is the 11th line
tell 254 count 10
 this is the 12th line
tell 277 count 11
 this is the 13th line
tell 300 count 12
 this is the 14th line
tell 323 count 13
 this is the 15th line
tell 346 count 14
 this is the 16th line
tell 369 count 15
 this is the 17th line
tell 392 count 16
 this is the 18th line
tell 415 count 17
 this is the 19th line
tell 438 count 18
 this is the 20th line
tell 461 count 19

答案 1 :(得分:1)

当您遍历文件it uses an internal buffer to minimize expensive IO operations时,文件不一定位于循环看到的最后一个字符处。