Question

我试图将数据写入h5py数据集，但使用高内存12核心GCE实例写入SSD磁盘，但它运行了13个小时，看不到尽头。我在GCE实例上运行Jupyter Notebook以取消大量小文件（存储在第二个非ssd磁盘上），然后将它们添加到存储在ssd磁盘上的文件中的h5py数据集

最大形状= (29914, 251328)
Chunks = (59, 982)
compression = gzip
dtype = float64

我的代码列在下面

#Get a sample
minsample = 13300
sampleWithOutReplacement = random.sample(ListOfPickles,minsample)

print(h5pyfile)
with h5py.File(h5pyfile, 'r+') as hf:
    GroupToStore = hf.get('group')
    DatasetToStore = GroupToStore.get('ds1')
    #Unpickle the contents and add in th h5py file                
    for idx,files in enumerate(sampleWithOutReplacement):
        #Sample the minimum number of examples
        time FilePath = os.path.join(SourceOfPickles,files)
        #Use this method to auto close the file
        with open(FilePath,"rb") as f:
            %time DatasetToStore[idx:] = pickle.load(f)
            #print("Processing file ",idx)

print("File Closed")

磁盘上的h5py文件似乎增加了1.4GB我使用上面和下面的代码填充的数据集是我在h5py文件中创建数据集的代码

group.create_dataset(labels, dtype='float64',shape= (maxSize, 251328),maxshape=(maxSize,251328),compression="gzip")

我可以对配置或代码或两者进行哪些改进，以减少填充h5py文件所需的时间？

更新1 我为jupyter笔记本添加了一些魔法来计算时间，我欢迎任何有关加快加载到数据存储区的建议，据报道这是 8hrs

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 14.1 µs
CPU times: user 8h 4min 11s, sys: 1min 18s, total: 8h 5min 30s
Wall time: 8h 5min 29s

Answer 1

这似乎非常错误：DatasetToStore [idx：]

您可能需要：DatasetToStore [idx，...]

我认为您的版本会在每次迭代时使用unpickled数据集覆盖idx之后的每一行。此版本仅在每次迭代时将单行覆盖到数据集。

Answer 2

JRoose是对的，代码似乎有些错误。
默认情况下，h5py仅使用1MB的块缓存，这对您的问题来说还不够。您可以在低级API中更改缓存设置，也可以使用h5py_cache。 https://pypi.python.org/pypi/h5py-cache/1.0

更改行
```
with h5py.File(h5pyfile, 'r+') as hf
```
到
```
with  h5py_cache.File(h5pyfile, 'r+',chunk_cache_mem_size=500*1024**2) as hf
```
将块缓存增加到例如500MB。
我假设pickle.load(f)导致一维数组;您的数据集是2D。在这种情况下，写
时没有错
```
%time DatasetToStore[idx,:] = pickle.load(f)
```
但根据我的发现，这将是相当缓慢的。要提高速度，请在将数据传递到数据集之前创建2D数组。
```
%time DatasetToStore[idx:idx+1,:] = np.expand_dims(pickle.load(f), axis=0)
```
我真的不知道为什么这会更快，但在我的脚本中，这个版本比上面的版本快20倍。从HDF5文件读取也是如此。

将数据写入SSD磁盘上的h5py似乎很慢：我该怎么做才能加快速度

2 个答案: