最近被问到how to do a file slurp in python,接受的答案提示如下:
with open('x.txt') as x: f = x.read()
我将如何执行此操作来读取文件并转换数据的字节序表示?
例如,我有一个1GB的二进制文件,它只是一堆单精度浮点数打包为大端,我想将它转换为小端并转储到一个numpy数组。下面是我为完成此操作而编写的函数以及一些调用它的实际代码。我使用struct.unpack
执行endian转换,并尝试使用mmap
来加速所有操作。
我的问题是,我是否正确使用了mmap
和struct.unpack
的诽谤?有更清洁,更快的方法吗?现在我有所作为,但我真的想学习如何做得更好。
提前致谢!
#!/usr/bin/python
from struct import unpack
import mmap
import numpy as np
def mmapChannel(arrayName, fileName, channelNo, line_count, sample_count):
"""
We need to read in the asf internal file and convert it into a numpy array.
It is stored as a single row, and is binary. Thenumber of lines (rows), samples (columns),
and channels all come from the .meta text file
Also, internal format files are packed big endian, but most systems use little endian, so we need
to make that conversion as well.
Memory mapping seemed to improve the ingestion speed a bit
"""
# memory-map the file, size 0 means whole file
# length = line_count * sample_count * arrayName.itemsize
print "\tMemory Mapping..."
with open(fileName, "rb") as f:
map = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
map.seek(channelNo*line_count*sample_count*arrayName.itemsize)
for i in xrange(line_count*sample_count):
arrayName[0, i] = unpack('>f', map.read(arrayName.itemsize) )[0]
# Same method as above, just more verbose for the maintenance programmer.
# for i in xrange(line_count*sample_count): #row
# be_float = map.read(arrayName.itemsize) # arrayName.itemsize should be 4 for float32
# le_float = unpack('>f', be_float)[0] # > for big endian, < for little endian
# arrayName[0, i]= le_float
map.close()
return arrayName
print "Initializing the Amp HH HV, and Phase HH HV arrays..."
HHamp = np.ones((1, line_count*sample_count), dtype='float32')
HHphase = np.ones((1, line_count*sample_count), dtype='float32')
HVamp = np.ones((1, line_count*sample_count), dtype='float32')
HVphase = np.ones((1, line_count*sample_count), dtype='float32')
print "Ingesting HH_Amp..."
HHamp = mmapChannel(HHamp, 'ALPSRP042301700-P1.1__A.img', 0, line_count, sample_count)
print "Ingesting HH_phase..."
HHphase = mmapChannel(HHphase, 'ALPSRP042301700-P1.1__A.img', 1, line_count, sample_count)
print "Ingesting HV_AMP..."
HVamp = mmapChannel(HVamp, 'ALPSRP042301700-P1.1__A.img', 2, line_count, sample_count)
print "Ingesting HV_phase..."
HVphase = mmapChannel(HVphase, 'ALPSRP042301700-P1.1__A.img', 3, line_count, sample_count)
print "Reshaping...."
HHamp_orig = HHamp.reshape(line_count, -1)
HHphase_orig = HHphase.reshape(line_count, -1)
HVamp_orig = HVamp.reshape(line_count, -1)
HVphase_orig = HVphase.reshape(line_count, -1)
答案 0 :(得分:7)
arr = numpy.fromfile(filename, numpy.dtype('>f4'))
# no byteswap is needed regardless of endianess of the machine
答案 1 :(得分:6)
with open(fileName, "rb") as f:
arrayName = numpy.fromfile(f, numpy.float32)
arrayName.byteswap(True)
速度和简洁性很难被击败;-)。对于byteswap,请参阅here(True
参数表示“在适当位置执行”);对于fromfile,请参阅here。
这在little-endian机器上工作(因为数据是big-endian,需要byteswap)。您可以测试是否有条件地执行byteswap,将最后一行从无条件调用更改为byteswap,例如:
if struct.pack('=f', 2.3) == struct.pack('<f', 2.3):
arrayName.byteswap(True)
,即对byteswap的调用以little-endianness测试为条件。
答案 2 :(得分:0)
您可以使用ASM based solution将CorePy拼凑在一起。我想知道,如果你能从算法的其他部分获得足够的性能。 I / O和对1GB数据块的操作将需要一段时间才能切片。
一旦你在python中对算法进行了原型化,你可能会发现有用的另一件事就是切换到C.我这样做是为了对一个全世界的DEM(高度)数据集进行一次操作。一旦我离开解释的脚本,整个事情就更容易忍受了。
答案 3 :(得分:0)
我希望这样的事情更快
arrayName[0] = unpack('>'+'f'*line_count*sample_count, map.read(arrayName.itemsize*line_count*sample_count))
请不要将map
用作变量名称