Question

我的程序的基本目标是读取图像并制作hd5格式文件。我将hd5数据文件拆分为1000个部分以便于管理。

程序读取图像并调整其大小，然后写入文件。

我不认为使用多线程会提高速度，但我可能错了。

我的数据集大约有1500万张图片。

我使用功能强大的4GB gpu和32 GB RAM以及Intel（R）Xeon（R）CPU E5-2687W v3 @ 3.10GHz

P.S我可以尝试使用像opencv这样的其他图像转换软件包，但没有比较基础。

截至目前，该计划已经连续3天运行，差不多80％完成。当我做类似的事情时，我想在将来避免这个问题。

ipfldr= "/path/to/img/fldr"
os.chdir(ipfldr)
SIZE = 58 # fixed size to all images
nof = 16

with open( '/path/to/txtfile', 'r' ) as T :
    lines = T.readlines()


# If you do not have enough memory split data into
# multiple batches and generate multiple separate h5 files
print len(lines)
X = np.zeros( (1000,nof*3, SIZE, SIZE), dtype=np.int )
y = np.zeros( (1000,1), dtype=np.int )
for i,l in enumerate(lines):
    sp = l.split(' ')#split files into 17 cats
    cla= int(sp[0].split("/")[0])
    for fr in range(0,nof,1):
        img = caffe.io.load_image( sp[fr] )
        img = caffe.io.resize( img, (3,SIZE, SIZE) ) # resize to fixed size
        # you may apply other input transformations here...
        X[i%1000,fr:fr+3] = img
    y[i%1000] = cla
    if i%1000==0 
        with h5py.File('val/'+'val'+str(int(i/1000))+'.h5','w') as H:
            H.create_dataset( 'data', data=X ) # note the name X given to the dataset!
            H.create_dataset( 'label', data=y ) # note the name y given to the dataset! 
        with open('val_h5_list.txt','w') as L:
            L.write( 'val'+str(int(i/1000))+'.h5' ) # list all h5 files you are going to use
        if (len(lines)-i >= 1000):
            X = np.zeros( (1000,nof*3, SIZE, SIZE), dtype=np.int )
            y = np.zeros( (1000,1), dtype=np.int )
        else:
            break

Answer 1

我很确定你可以通过多线程方法提高性能，你没有花3天时间从磁盘加载数据（你需要一些不切实际的磁盘空间来读取），所以你似乎等待CPU上的调整大小过程。

你可以这样做：1读取器读取大量吸盘中的数据并将单个图像放入队列中。一些工作人员从队列中获取图像，调整大小并将其放入另一个队列。 1将调整大小的图像从第二个队列中取出的写入器，当它收集很多时将它们写入磁盘（读取器和写入器可能是相同的过程而没有效率损失，假设您无论如何都读取/写入相同的磁盘）。

我的猜测是每个HW线程1个工作者（在你的情况下是16个），你把读者和写者放在核心上的负数减去2个（所以14个）应该是一个很好的起点。

通过这种方式，您将隔离等待CPU访问的IO工作，并通过在每次初始化读/写时执行大量工作来最小化IO访问开销。

提高I / O操作的速度？用于caffe的HDF5数据创建

1 个答案: