Question

作为一个个人项目，我正在使用C ++开发具有实时碰撞物理的简单2D游戏引擎。我的碰撞通过计算唯一一对对象之间的碰撞时间来处理。为此，我使用std::vector<float>构造了自己的连续2D矩阵类来存储这些碰撞时间。

我的主要物理循环的一部分涉及为碰撞矩阵中的所有元素添加一个常数，称为Matrix2D::addConstValue(float)。由于某些原因，某些系统报告此功能是因为在gprof中使用了大量CPU时间。结果，该程序的运行速度通常比其他程序慢得多。例如，在一个系统上，一次大量冲突会导致较小的帧速率下降。在较差的系统上，这些完全相同的碰撞集会使帧速率变为个位数，并显着降低仿真速度。

这些是我在以下程序上运行过的系统：

PC 1:

OS: Windows7
CPU: AMD Phenom II x4 960T
GPU: AMD Radeon HD6850
RAM: 8GB
Program performance: Good

PC2:

OS: Windows 10
CPU: Intel i5 2500K
GPU: AMD Radeon HD7970
RAM: 8GB
Program Performance: Bad

PC3 (laptop):

OS: Windows 10 + Xubuntu 16.04 (Dual boot)
CPU: Intel i5 5600u
GPU: Intel HD5000
RAM: 12GB
Program Performance: Good in Xubuntu, bad in Windows 10

PC4:

OS: Windows 10
CPU: AMD FX-6300
GPU: nVidia GTX 970
RAM: 8GB
Program Performance: Good

我本来希望PC2优于PC1，但是由于调用上述矩阵函数，PC2报告的CPU使用率要高得多。以下是PC1和PC2的gprof结果

PC1：

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 14.44      0.66     0.66 81222460     0.00     0.00  Ball::getDistance(Ball&)
 12.47      1.23     0.57 319194829     0.00     0.00  sfVectorMath::dot(sf::Vector2<float>, sf::Vector2<float>)
 12.47      1.80     0.57 55453088     0.00     0.00  Collisions::timeToCollision(Ball&, Ball&)
 11.16      2.31     0.51 81222460     0.00     0.00  Ball::getGPE(Ball&)
  6.78      2.62     0.31 153865899     0.00     0.00  Matrix2d::getElementValue(int, int)

PC2：

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 77.83     23.49    23.49     8332     0.00     0.00  Matrix2d::addConstValue(float)
  7.59     25.78     2.29                             _mcount_private
  4.67     27.19     1.41 40603954     0.00     0.00  Collisions::timeToCollision(Ball&, Ball&)
  1.29     27.58     0.39                             pow
  1.19     27.94     0.36    11466     0.00     0.00  Matrix2d::getMatrixMin()
  0.99     28.24     0.30 206105049     0.00     0.00  sfVectorMath::dot(sf::Vector2<float>, sf::Vector2<float>)
  0.93     28.52     0.28                             internal_modf
  0.83     28.77     0.25 122492898     0.00     0.00  Matrix2d::getElementValue(int, int)

我真的对发生的事情不知所措。其他一些细节：Linux和Windows版本均使用GCC 6.1.0和SFML 2.4.2进行编译。在Windows 10上进行本机编译不会影响性能。

编辑：此外，addConstValue

的实现

void Matrix2d::addConstValue(float value)
{
    for(unsigned int i=0; i<matrix.size(); ++i)
        matrix.at(i) += value;
}

Answer 1

TL; DR：请勿将NaN存储在向量中，当然也不要尝试读取它们！还要尝试避免对NaN进行操作，以防万一。

我通过设置242 * 242矩阵并填充零或std::numeric_limits<float>::quiet_NaN()来测试矩阵类的性能。然后，我在矩阵上执行了addConstValue(float)函数。下表是每个通话所花费的平均时间。当矩阵填充零时，进行了50000次调用；填充NaN填充时，进行了500次调用：

W10 2500k, filled with zeros: 34.54µs
W10 2500k, filled with NaNs: 6121.64µs
W7 960T, filled with zeros: 52.73µs
W7 960T, filled with NaNs: 62.4µs
W10 i5 5600u, filled with zeros: 27.50µs
W10 i5 5600u, filled with NaNs: 7062.63µs

因此，很明显，在PC 2和3上尝试在NaN上进行操作的速度要慢200倍左右。奇怪的是，此瓶颈在AMD机器上不存在。然后，我添加了一个快速检查，以查看矢量元素是否为std::isnan()中的nan（使用addConstValue(float)）。以下是每次通话的执行时间：

W10 2500k, filled with zeros: 70.05µs
W10 2500k, filled with NaNs: 70.05µs
W10 i5 5600u, filled with zeros: 93.75µs
W10 i5 5600u, filled with NaNs: 62.50µs

这会使填充零的矩阵的执行时间增加一倍，但显着减少了填充NaN的矩阵的时间。

为进一步解决该问题，我设置了一个循环，以将一个常量浮点数添加到裸NaN上，并添加一个循环到std::vector中，该{浮动}仅包含一个经过一千万次循环的NaN。这是程序：

#include <iostream>
#include <limits>
#include <chrono>
#include <vector>

using namespace std;
using namespace std::chrono;

int main()
{
    float nan = std::numeric_limits<float>::quiet_NaN();
    std::vector<float> nanvec = {nan};

    int noPasses = 10000000;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();

    for(int i=0; i<noPasses; ++i)
        nan += -1.0f;

    high_resolution_clock::time_point t2 = high_resolution_clock::now();
    auto duration = duration_cast<microseconds>( t2 - t1 ).count();
    cout << "Bare float NaN: " << duration << " microseconds\n" ;


    t1 = high_resolution_clock::now();

    for(int i=0; i<noPasses; ++i)
        nanvec[0] += -1.0f;

    t2 = high_resolution_clock::now();
    duration = duration_cast<microseconds>( t2 - t1 ).count();
    cout << "Vector NaN: " << duration << " microseconds\n" ;

    return 0;
}

我的输出（W10，i5 2500k）：

Bare float NaN: 0 microseconds
Vector NaN: 1122833 microseconds

因此，看起来CPU知道忽略NaN操作。从容器中检索NaN是否有可能导致较长的执行时间？我也仍然不知道为什么仅在某些系统上会出现此问题。

无论如何，我将检查NaN的快速解决方案整合到了我的游戏引擎中，并且提速令人难以置信。不再存在与从载体中提取NaN相关的任何瓶颈（已通过gprof检查）。我可能会尝试找到一种方法，避免为了获得每次通话额外50％的性能而不得不进行检查。

在某些系统上，std :: vector操作变慢

1 个答案: