Question

我试图将boost :: multi_array的性能与本机动态分配的数组进行比较，并使用以下测试程序：

#include <windows.h>
#define _SCL_SECURE_NO_WARNINGS
#define BOOST_DISABLE_ASSERTS 
#include <boost/multi_array.hpp>

int main(int argc, char* argv[])
{
    const int X_SIZE = 200;
    const int Y_SIZE = 200;
    const int ITERATIONS = 500;
    unsigned int startTime = 0;
    unsigned int endTime = 0;

    // Create the boost array
    typedef boost::multi_array<double, 2> ImageArrayType;
    ImageArrayType boostMatrix(boost::extents[X_SIZE][Y_SIZE]);

    // Create the native array
    double *nativeMatrix = new double [X_SIZE * Y_SIZE];

    //------------------Measure boost----------------------------------------------
    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                boostMatrix[x][y] = 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Boost] Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);

    //------------------Measure native-----------------------------------------------
    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                nativeMatrix[x + (y * X_SIZE)] = 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Native]Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);

    return 0;
}

我得到以下结果：

[Boost] Elapsed time: 12.500 seconds
[Native]Elapsed time:  0.062 seconds

我无法相信multi_arrays会慢很多。谁能发现我做错了什么？

我认为缓存不是问题，因为我正在写入内存。

编辑：这是一个调试版本。 Per Laserallan建议我做一个发布版本：

[Boost] Elapsed time:  0.266 seconds
[Native]Elapsed time:  0.016 seconds

更接近。但对我来说，16比1似乎仍然很高。

好吧，没有明确的答案，但我现在要继续使用本机数组离开我的真实代码。

接受Laserallan的答案，因为这是我测试中最大的缺陷。

感谢所有人。

Answer 1

在我的机器上使用

g++ -O3 -march=native -mtune=native --fast-math -DNDEBUG test.cpp -o test && ./test

我得到了

[Boost] Elapsed time:  0.020 seconds
[Native]Elapsed time:  0.020 seconds

然而，将const int ITERATIONS更改为5000我

[Boost] Elapsed time:  0.240 seconds
[Native]Elapsed time:  0.180 seconds

然后ITERATIONS返回500但X_SIZE和Y_SIZE设置为400我得到了更大的差异

[Boost] Elapsed time:  0.460 seconds
[Native]Elapsed time:  0.070 seconds

最后反转[Boost]案例的内部循环，使其看起来像

    for (int x = 0; x < X_SIZE; ++x)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {

并将ITERATIONS，X_SIZE和Y_SIZE保持到500，400和400我

[Boost] Elapsed time:  0.060 seconds
[Native]Elapsed time:  0.080 seconds

如果我也为[Native]情况反转了内部循环（因此对于那种情况它的顺序错误），我得到了，毫不奇怪，

[Boost] Elapsed time:  0.070 seconds
[Native]Elapsed time:  0.450 seconds

我在Ubuntu 10.10上使用gcc (Ubuntu/Linaro 4.4.4-14ubuntu5) 4.4.5

总之：

使用正确的优化 boost :: multi_array按预期完成其工作
您访问数据的顺序非常重要

Answer 2

您的测试存在缺陷。

在DEBUG构建中，boost :: MultiArray缺少它非常需要的优化传递。（远远超过本机阵列）
在RELEASE构建中，您的编译器将查找可以直接删除的代码，并且您的大部分代码都属于该类别。

您可能会看到的是优化编译器看到可以删除大多数或所有“本机阵列”循环的结果。你的boost :: MultiArray循环在理论上也是如此，但MultiArray可能足以击败你的优化器。

对您的测试平台进行微小的更改，您会看到更多逼真的结果：将“= 2.345”的出现更改为“*= 2.345”和再次使用优化进行编译。这将阻止编译器发现每个测试的外部循环都是冗余的。

我做到了，速度比较接近2：1。

Answer 3

您正在构建发布还是调试？

如果在调试模式下运行，则boost数组可能非常慢，因为它们的模板魔法没有正确内联，在函数调用中会产生大量开销。我不确定如何实现多数组，所以这可能完全关闭：）

也许存储顺序也有一些差异，因此您可能会逐列存储图像并逐行写入。这会导致缓存行为不佳，并可能减慢速度。

尝试切换X和Y循环的顺序，看看你是否获得了任何东西。这里有一些关于存储订购的信息： http://www.boost.org/doc/libs/1_37_0/libs/multi_array/doc/user.html

编辑：由于您似乎使用二维数组进行图像处理，因此您可能有兴趣查看boost图像处理库gil。

它可能具有较少开销的数组，可以完全适合您的情况。

Answer 4

考虑使用Blitz ++。我尝试了Blitz，它的性能与C风格阵列相当！

使用下面添加的Blitz查看您的代码：

#include <windows.h>
#define _SCL_SECURE_NO_WARNINGS
#define BOOST_DISABLE_ASSERTS 
#include <boost/multi_array.hpp>
#include <blitz/array.h>

int main(int argc, char* argv[])
{
    const int X_SIZE = 200;
    const int Y_SIZE = 200;
    const int ITERATIONS = 500;
    unsigned int startTime = 0;
    unsigned int endTime = 0;

    // Create the boost array
    typedef boost::multi_array<double, 2> ImageArrayType;
    ImageArrayType boostMatrix(boost::extents[X_SIZE][Y_SIZE]);


    //------------------Measure boost----------------------------------------------
    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                boostMatrix[x][y] = 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Boost] Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);

    //------------------Measure blitz-----------------------------------------------
    blitz::Array<double, 2> blitzArray( X_SIZE, Y_SIZE );
    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                blitzArray(x,y) = 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Blitz] Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);


    //------------------Measure native-----------------------------------------------
    // Create the native array
    double *nativeMatrix = new double [X_SIZE * Y_SIZE];

    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                nativeMatrix[x + (y * X_SIZE)] = 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Native]Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);



    return 0;
}

这是调试和发布的结果。

DEBUG：

Boost  2.093 secs 
Blitz  0.375 secs 
Native 0.078 secs

RELEASE：

Boost  0.266 secs
Blitz  0.016 secs
Native 0.015 secs

我使用了MSVC 2008 SP1编译器。

我们现在能告诉C-stlye阵列了吗？ = P

Answer 5

我想知道两件事：

1）边界检查：在应用程序中包含multi_array.hpp之前定义BOOST_DISABLE_ASSERTS预处理器宏。这将关闭绑定检查。当NDEBUG是。

时，不确定这是否会被禁用

2）基础指数： MultiArray可以从不同于0的基数索引数组。这意味着multi_array存储一个基数（在每个维度中）并使用更复杂的公式来获取内存中的确切位置，我想知道它是否就是这个。

否则我不明白为什么多阵列应该比C阵列慢。

Answer 6

我正在看这个问题因为我有同样的问题。我有一些想法要进行更严格的测试。

正如rodrigob指出的那样，循环次序中存在缺陷，因此您最初附加的代码中的任何结果都会产生误导性数据
此外，还有一些使用常量设置的小型数组。编译器可能会优化循环，而实际上编译器不会知道数组的大小。为了以防万一，数组的大小和迭代次数应该是运行时输入。

在Mac上，以下代码配置为提供更有意义的答案。这里有4个测试。

#define BOOST_DISABLE_ASSERTS
#include "boost/multi_array.hpp"
#include <sys/time.h>
#include <stdint.h>
#include<string>

uint64_t GetTimeMs64()
{
  struct timeval tv;

  gettimeofday( &tv, NULL );

  uint64_t ret = tv.tv_usec;
  /* Convert from micro seconds (10^-6) to milliseconds (10^-3) */
  ret /= 1000;

  /* Adds the seconds (10^0) after converting them to milliseconds (10^-3) */
  ret += ( tv.tv_sec * 1000 );

  return ret;

}


void function1( const int X_SIZE, const int Y_SIZE, const int ITERATIONS )
{

  double nativeMatrix1add[X_SIZE*Y_SIZE];

  for( int x = 0 ; x < X_SIZE ; ++x )
  {
    for( int y = 0 ; y < Y_SIZE ; ++y )
    {
      nativeMatrix1add[y + ( x * Y_SIZE )] = rand();
    }
  }

  // Create the native array
  double* __restrict const nativeMatrix1p = new double[X_SIZE * Y_SIZE];
  uint64_t startTime = GetTimeMs64();
  for( int i = 0 ; i < ITERATIONS ; ++i )
  {
    for( int xy = 0 ; xy < X_SIZE*Y_SIZE ; ++xy )
    {
      nativeMatrix1p[xy] += nativeMatrix1add[xy];
    }
  }
  uint64_t endTime = GetTimeMs64();
  printf( "[Native Pointer]    Elapsed time: %6.3f seconds\n", ( endTime - startTime ) / 1000.0 );

}

void function2( const int X_SIZE, const int Y_SIZE, const int ITERATIONS )
{

  double nativeMatrix1add[X_SIZE*Y_SIZE];

  for( int x = 0 ; x < X_SIZE ; ++x )
  {
    for( int y = 0 ; y < Y_SIZE ; ++y )
    {
      nativeMatrix1add[y + ( x * Y_SIZE )] = rand();
    }
  }

  // Create the native array
  double* __restrict const nativeMatrix1 = new double[X_SIZE * Y_SIZE];
  uint64_t startTime = GetTimeMs64();
  for( int i = 0 ; i < ITERATIONS ; ++i )
  {
    for( int x = 0 ; x < X_SIZE ; ++x )
    {
      for( int y = 0 ; y < Y_SIZE ; ++y )
      {
        nativeMatrix1[y + ( x * Y_SIZE )] += nativeMatrix1add[y + ( x * Y_SIZE )];
      }
    }
  }
  uint64_t endTime = GetTimeMs64();
  printf( "[Native 1D Array]   Elapsed time: %6.3f seconds\n", ( endTime - startTime ) / 1000.0 );

}


void function3( const int X_SIZE, const int Y_SIZE, const int ITERATIONS )
{

  double nativeMatrix2add[X_SIZE][Y_SIZE];

  for( int x = 0 ; x < X_SIZE ; ++x )
  {
    for( int y = 0 ; y < Y_SIZE ; ++y )
    {
      nativeMatrix2add[x][y] = rand();
    }
  }

  // Create the native array
  double nativeMatrix2[X_SIZE][Y_SIZE];
  uint64_t startTime = GetTimeMs64();
  for( int i = 0 ; i < ITERATIONS ; ++i )
  {
    for( int x = 0 ; x < X_SIZE ; ++x )
    {
      for( int y = 0 ; y < Y_SIZE ; ++y )
      {
        nativeMatrix2[x][y] += nativeMatrix2add[x][y];
      }
    }
  }
  uint64_t endTime = GetTimeMs64();
  printf( "[Native 2D Array]   Elapsed time: %6.3f seconds\n", ( endTime - startTime ) / 1000.0 );

}



void function4( const int X_SIZE, const int Y_SIZE, const int ITERATIONS )
{

  boost::multi_array<double, 2> boostMatrix2add( boost::extents[X_SIZE][Y_SIZE] );

  for( int x = 0 ; x < X_SIZE ; ++x )
  {
    for( int y = 0 ; y < Y_SIZE ; ++y )
    {
      boostMatrix2add[x][y] = rand();
    }
  }

  // Create the native array
  boost::multi_array<double, 2> boostMatrix( boost::extents[X_SIZE][Y_SIZE] );
  uint64_t startTime = GetTimeMs64();
  for( int i = 0 ; i < ITERATIONS ; ++i )
  {
    for( int x = 0 ; x < X_SIZE ; ++x )
    {
      for( int y = 0 ; y < Y_SIZE ; ++y )
      {
        boostMatrix[x][y] += boostMatrix2add[x][y];
      }
    }
  }
  uint64_t endTime = GetTimeMs64();
  printf( "[Boost Array]       Elapsed time: %6.3f seconds\n", ( endTime - startTime ) / 1000.0 );

}

int main( int argc, char* argv[] )
{

  srand( time( NULL ) );

  const int X_SIZE = std::stoi( argv[1] );
  const int Y_SIZE = std::stoi( argv[2] );
  const int ITERATIONS = std::stoi( argv[3] );

  function1( X_SIZE, Y_SIZE, ITERATIONS );
  function2( X_SIZE, Y_SIZE, ITERATIONS );
  function3( X_SIZE, Y_SIZE, ITERATIONS );
  function4( X_SIZE, Y_SIZE, ITERATIONS );

  return 0;
}

只使用带有整数数学和双循环的[]的单维数组>
使用指针递增的同一维数组
多维C数组
提升multi_array

从命令行运行，运行

./test_array xsize ysize iterations"

您可以很好地了解这些方法的表现。以下是我使用以下编译器标志获得的内容：

g++4.9.2 -O3 -march=native -funroll-loops -mno-avx --fast-math -DNDEBUG  -c -std=c++11


./test_array 51200 1 20000
[Native 1-Loop ]    Elapsed time:  0.537 seconds
[Native 1D Array]   Elapsed time:  2.045 seconds
[Native 2D Array]   Elapsed time:  2.749 seconds
[Boost Array]       Elapsed time:  1.167 seconds

./test_array 25600 2 20000
[Native 1-Loop ]    Elapsed time:  0.531 seconds
[Native 1D Array]   Elapsed time:  1.241 seconds
[Native 2D Array]   Elapsed time:  1.631 seconds
[Boost Array]       Elapsed time:  0.954 seconds

./test_array 12800 4 20000
[Native 1-Loop ]    Elapsed time:  0.536 seconds
[Native 1D Array]   Elapsed time:  1.214 seconds
[Native 2D Array]   Elapsed time:  1.223 seconds
[Boost Array]       Elapsed time:  0.798 seconds

./test_array 6400 8 20000
[Native 1-Loop ]    Elapsed time:  0.540 seconds
[Native 1D Array]   Elapsed time:  0.845 seconds
[Native 2D Array]   Elapsed time:  0.878 seconds
[Boost Array]       Elapsed time:  0.803 seconds

./test_array 3200 16 20000
[Native 1-Loop ]    Elapsed time:  0.537 seconds
[Native 1D Array]   Elapsed time:  0.661 seconds
[Native 2D Array]   Elapsed time:  0.673 seconds
[Boost Array]       Elapsed time:  0.708 seconds

./test_array 1600 32 20000
[Native 1-Loop ]    Elapsed time:  0.532 seconds
[Native 1D Array]   Elapsed time:  0.592 seconds
[Native 2D Array]   Elapsed time:  0.596 seconds
[Boost Array]       Elapsed time:  0.764 seconds

./test_array 800 64 20000
[Native 1-Loop ]    Elapsed time:  0.546 seconds
[Native 1D Array]   Elapsed time:  0.594 seconds
[Native 2D Array]   Elapsed time:  0.606 seconds
[Boost Array]       Elapsed time:  0.764 seconds

./test_array 400 128 20000
[Native 1-Loop ]    Elapsed time:  0.536 seconds
[Native 1D Array]   Elapsed time:  0.560 seconds
[Native 2D Array]   Elapsed time:  0.564 seconds
[Boost Array]       Elapsed time:  0.746 seconds

所以，我认为可以肯定地说，boost multi_array表现得非常好。没有什么比单循环评估更好，但取决于数组的维度，boost :: multi_array可能会击败具有双循环的标准c数组。

Answer 7

要尝试的另一件事是使用迭代器而不是boost数组的直接索引。

Answer 8

我本来期望多阵列同样有效。但是我在使用gcc的PPC Mac上得到了类似的结果。我也尝试过multiarrayref，这样两个版本都使用相同的存储而没有区别。这很有用，因为我在我的一些代码中使用了多阵列，并且假设它与手动编码类似。

Answer 9

我想我知道问题是什么......也许。

为了使boost实现具有如下语法：matrix [x] [y]。这意味着matrix [x]必须返回一个对象的引用，该对象的作用类似于一维数组列，此时引用[y]为您提供元素。

这里的问题是你正在以行主顺序进行迭代（这在c / c ++中是典型的，因为本机数组是行主IIRC。编译器必须重新执行矩阵[x]在这种情况下每个y。如果在使用boost矩阵时按列主顺序迭代，您可能会看到更好的性能。

只是一个理论。

编辑：在我的linux系统上（稍作修改）我测试了我的理论，并通过切换x和y确实显示了一些性能提升，但它仍然比原生数组慢。这可能是编译器无法优化临时引用类型的简单问题。

Answer 10

在发布模式下构建，使用objdump，然后查看程序集。他们可能会做完全不同的事情，你将能够看到编译器正在使用哪些优化。

Answer 11

在这里提出并回答了类似的问题：

http://www.codeguru.com/forum/archive/index.php/t-300014.html

简短的回答是，编译器最容易优化简单数组，而不是那么容易优化Boost版本。因此，特定的编译器可能不会为Boost版本提供所有相同的优化优势。

编译器的优化程度也会有所不同，而且保守程度也会有所不同（例如使用模板化代码或其他复杂功能）。

Answer 12

我使用gcc 4.2.1

在Snow Leopard Mac OS上进行了测试

Debug:
[Boost] Elapsed time:  2.268 seconds
[Native]Elapsed time:  0.076 seconds

Release:
[Boost] Elapsed time:  0.065 seconds
[Native]Elapsed time:  0.020 seconds

这里是代码（经过修改以便可以在Unix上编译）：

#define BOOST_DISABLE_ASSERTS
#include <boost/multi_array.hpp>
#include <ctime>

int main(int argc, char* argv[])
{
    const int X_SIZE = 200;
    const int Y_SIZE = 200;
    const int ITERATIONS = 500;
    unsigned int startTime = 0;
    unsigned int endTime = 0;

    // Create the boost array
    typedef boost::multi_array<double, 2> ImageArrayType;
    ImageArrayType boostMatrix(boost::extents[X_SIZE][Y_SIZE]);

    // Create the native array
    double *nativeMatrix = new double [X_SIZE * Y_SIZE];

    //------------------Measure boost----------------------------------------------
    startTime = clock();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                boostMatrix[x][y] = 2.345;
            }
        }
    }
    endTime = clock();
    printf("[Boost] Elapsed time: %6.3f seconds\n", (endTime - startTime) / (double)CLOCKS_PER_SEC);

    //------------------Measure native-----------------------------------------------
    startTime = clock();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                nativeMatrix[x + (y * X_SIZE)] = 2.345;
            }
        }
    }
    endTime = clock();
    printf("[Native]Elapsed time: %6.3f seconds\n", (endTime - startTime) / (double)CLOCKS_PER_SEC);

    return 0;
}

Answer 13

查看g ++ 4.8.2使用-O3 -DBOOST_DISABLE_ASSERTS生成的程序集并使用operator()和[][]方式访问元素，显然只有额外的与原生数组相比较的操作和手动索引计算是基数的增加。虽然我没有衡量成本。

Answer 14

我修改了visual studio 2008 v9.0.21022中的上述代码，并应用了C和C ++的Numerical Recipe例程中的容器例程

http://www.nrbook.com/nr3/分别使用他们的许可例程dmatrix和MatDoub

dmatrix使用过时语法malloc运算符，不建议使用... MatDoub使用New命令

以秒为单位的速度在发布版本中：

提升：0.437

原生：0.032

数值配方C：0.031

数值配方C ++：0.031

所以从上面的闪电战看起来是最好的免费替代品。

Answer 15

我已经在VC ++ 2010下编译了代码（略有修改）并启用了优化（“最大化速度”以及内联“任何合适的”功能和“支持快速代码”）并得到时间0.015 / 0.391。我已经生成了汇编列表，虽然我是一个糟糕的组装菜鸟，但在boost-measuring循环中有一条线对我来说不太好看：

call    ??A?$multi_array_ref@N$01@boost@@QAE?AV?$sub_array@N$00@multi_array@detail@1@H@Z ; boost::multi_array_ref<double,2>::operator[]

其中一个[]运算符没有内联！被叫程序再次拨打电话，这次是multi_array::value_accessor_n<...>::access<...>()：

call    ??$access@V?$sub_array@N$00@multi_array@detail@boost@@PAN@?$value_accessor_n@N$01@multi_array@detail@boost@@IBE?AV?$sub_array@N$00@123@U?$type@V?$sub_array@N$00@multi_array@detail@boost@@@3@HPANPBIPBH3@Z ; boost::detail::multi_array::value_accessor_n<double,2>::access<boost::detail::multi_array::sub_array<double,1>,double *>

总而言之，这两个程序只需要很多代码来访问数组中的单个元素。我的总体印象是，库是如此复杂和高级，以至于Visual Studio无法像我们希望的那样对其进行优化（使用gcc的海报显然有更好的结果）。

恕我直言，一个好的编译器确实应该内联并优化这两个程序 - 两者都非常简短直接，不包含任何循环等。很多时候可能只是浪费了他们的参数和结果。

Answer 16

正如rodrigob回答的那样，激活适当的优化（GCC的默认值为-O0）是获得良好性能的关键。此外，我还使用Blaze DynamicMatrix进行了测试，使用完全相同的优化标志还获得了额外的因子2性能改进。 https://bitbucket.org/account/user/blaze-lib/projects/BLAZE

Boost :: multi_array性能问题

16 个答案: