Question

我正在将数据写入三维数据集，我注意到一个非常令人不安的问题。我想在数据集中编写20个2000x2000矩阵。我注意到写入2000x2000x20的数据集比写入20x2000x2000的数据集要慢得多。有没有人知道为什么？

慢 - 时间：66.4123821259

import h5py
import numpy as np

file1 = h5py.File('/home/serra/Documents/Software/Store_testdata/TestDataset.h5')
a = file1.create_group('run')
b = a.create_dataset('seq_1',(2000,2000,20))

for i in range(20):
    b[:,:,i] = np.random.rand(2000,2000)

file1.close()

快速时间：3.72713208199

import h5py
import numpy as np

file1 = h5py.File('/home/serra/Documents/Software/Store_testdata/TestDataset.h5')
a = file1.create_group('run')
b = a.create_dataset('seq_1',(20,2000,2000))

for i in range(20):
    b[i,:,:] = np.random.rand(2000,2000)

file1.close()

Answer 1

性能差异与矩阵的大小无关，而与填充数据的顺序无关：

b[i,:,:] = np.random.rand(2000,2000)
b[:,:,i] = np.random.rand(2000,2000)

在第一种情况下，您正在填充内存中连续的单元格。在第二种情况下，细胞分散在内存中。

当项目处于连续内存中时，所有相邻单元格可能会在获取第一个相邻单元格时被缓存。在另一种情况下，当获取一个时，将不会使用将保留在缓存中的大多数相邻单元格。

出于ilustration目的，让我们考虑二维情况，让我们假设两个项目适合缓存。以下矩阵：

numpy.array('[[10, 20, 30], [40, 50, 60]]')

像这样存储在内存中：

10 20 30 40 50 60

让我们看看当我们按行顺序迭代时会发生什么：

a[0][0] → fetch 10 from memory (cached: 10 20)
a[0][1] → read 20 from cache
a[0][2] → fetch 30 from memory (cached: 30 40)
a[1][0] → read 40 from cache
a[1][1] → fetch 50 from memory (cached: 50 60)
a[1][2] → read 60 from cache

现在，让我们按列顺序迭代：

a[0][0] → fetch 10 from memory (cached: 10 20)
a[1][0] → fetch 40 from memory (cached: 30 40)
a[2][1] → fetch 20 from memory (cached: 10 20)
a[0][1] → fetch 50 from memory (cached: 50 60)
a[1][2] → fetch 30 from memory (cached: 30 40)
a[1][2] → fetch 60 from memory (cached: 50 60)

因此，在第一种情况下，您可以仅使用三次内存访问来遍历整个矩阵，而在第二种情况下，您需要六次。根据经验，从内存中读取值比从缓存中读取值慢约200倍。

Answer 2

我猜测当你写入20x2000x2000的数据集时它更快的原因是因为没有比较和增量/减量的完成。将其视为for循环，如下所示（2000x2000x20）：

for (int i = 0; i < 2000; i++)
{
    for (int j = 0; j < 2000; j++)
    {
        for (int k = 0; k < 20; k++)
        {
            dataset[i][j][k] = data;
        }
    }
}

比较操作次数：88,004,001

增量操作次数：84,002,000

在下一个循环（20x2000x2000）中：

for (int i = 0; i < 20; i++)
{
    for (int j = 0; j < 2000; j++)
    {
        for (int k = 0; k < 2000; k++)
        {
            dataset[i][j][k] = data;
        }
    }
}

比较操作次数：80,040,020

增量操作次数：80,080,041

感谢我通过此链接http://umencs.blogspot.com/2013/04/optimization-of-nested-for-loops.html

创建的这个方便的功能

void ComparisonAndIncrementCount(int nOuterLoop, int nMiddleLoop, int nInnerLoop)
{
    int nComparisonCount = 0;
    int nIncrementCount = 0;

    for (int i = 0; (++nComparisonCount) && i < nOuterLoop; i++, ++nIncrementCount)
    {
        for (int j = 0; (++nComparisonCount) && j < nMiddleLoop; j++, ++nIncrementCount)
        {
            for (int k = 0; (++nComparisonCount) && k < nInnerLoop; k++, ++nIncrementCount) {}
        }
    }

    printf("\n#No. of Increment Operations of Nested For Loop: %d", nIncrementCount);
    printf("\n#No. of Comparison Operations of Nested For Loop: %d", nComparisonCount);
}

三维数据集

2 个答案: