Question

所以，我正在编写一个使用SSE内在函数的数学库来与我的OpenGL应用程序一起使用。现在我正在实现一些比较重要的函数，比如lookAt，使用glm库来检查是否正确，但由于某种原因，我的lookAt实现不能正常工作。

这是源代码：

inline void lookAt(__m128 position, __m128 target, __m128 up)
{
    /* Get the target vector relative to the camera position */
    __m128 t = vec4::normalize3(_mm_sub_ps(target, position));
    __m128 u = vec4::normalize3(up);
    /* Get the right vector by crossing target and up. */
    __m128 r = vec4::normalize3(vec4::cross(t, u));
    /* Correct the up vector by crossing right and target. */
    u = vec4::cross(r, t);
    /* Negate the target vector. */
    t = _mm_sub_ps(_mm_setzero_ps(), t);

    /* Treat the right, up, and target vector as a matrix, and transpose it. */
    /* Conveniently, this also sets the w component of all four to 0.0f */
    _MM_TRANSPOSE4_PS(r, u, t, _mm_setr_ps(0.0f, 0.0f, 0.0f, 1.0f));

    vec4 pos = _mm_sub_ps(_mm_setzero_ps(), position);
    pos.w = 1.0f;

    /* Multiply our matrix by the transposed vectors. */
    mat4 temp;
    temp.col0 = r;
    temp.col1 = u;
    temp.col2 = t;
    temp.col3 = _mm_setr_ps(0.0f, 0.0f, 0.0f, 1.0f);

    multiply(temp);
    translate(pos);
}

我的矩阵是列专业，内部存储为“__m128 col0，col1，col2，col3;”。

我在阅读了关于gluLookAt的手册页Here之后做了这个。一旦我意识到向右，向上和目标向量看起来非常像行主矩阵，我就很容易转置它们，以便将它们分配给旋转矩阵。

normalize3的代码，如果它有帮助：

inline static __m128 normalize3(const __m128& vec)
{
    __m128 v = _mm_mul_ps(vec, vec);
    v = _mm_add_ps(
        _mm_add_ps(
            _mm_shuffle_ps(v, v, _MM_SHUFFLE(0, 0, 0, 0)),
            _mm_shuffle_ps(v, v, _MM_SHUFFLE(1, 1, 1, 1))),
        _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 2, 2, 2)));

    return _mm_mul_ps(vec, _mm_rsqrt_ps(v));
}

通过忽略向量的w分量来保存几个调用。

我做错了什么？

这是一些示例输出。使用position（5.0,5.0,0.0），target（10.0,20.0,55.0）和up（0.0,1.0,0.0），我得到：

来自GLM：

[ - 0.9959] [0.0000] [0.0905] [4.9795]
[ - 0.0237] [0.9650] [-0.2610] [-4.7065]
[ - 0.0874] [-0.2621] [-0.9611] [1.7474]
[0.0000] [0.0000] [0.0000] [1.0000]

从我的lookAt（）：

[ - 0.9959] [0.0000] [0.0905] [-5.0000]
[ - 0.0237] [0.9651] [-0.2610] [-5.0000]
[ - 0.0874] [-0.2621] [-0.9611] [0.0000]
[0.0000] [0.0000] [0.0000] [1.0000]

似乎唯一的区别在于第三栏，但我老实说不确定哪两个是正确的。我倾向于说GLM是正确的，因为它的设计与glu版本相同。

编辑：我发现了一些有趣的东西。如果我称之为“翻译（pos）;”在调用“multiply（temp）;”之前，我得到的矩阵与glm的完全相同。哪个是对的？根据gluLookAt上的OpenGL手册页，这个（以及glm）正在向后做。我以前做过这件事，还是现在纠正了？

Answer 1

_mm_rsqrt_ps(v)可能存在一个问题。这不是很准确。将其替换为_mm_div_ps(_mm_set1_ps(1.0f),_mm_sqrt_ps(v))。如果这样可以解决问题，那么您可以通过某种根抛光来加快速度Newton Raphson with SSE2 - can someone explain me these 3 lines

另一个建议是，您可以通过不进行水平操作（在规范化功能中执行）来使您的功能更加SIMD友好。在转置之前不是对矢量进行标准化，而是先进行转置。这将从（x，y，z，w）到（x，x，x，x），（y，y，y，y），（z，z，z，z），（w，w， w，w） - 一个结构数组（AoS）到一个数组结构（SoA）。那么你只需要做1.0f / sqrt（r r + u u + t * t）来规范化。

__m128 t = _mm_sub_ps(target, position));
__m128 u = up;
__m128 r = vec4::cross(t, u);
u = vec4::cross(r, t);
t = _mm_sub_ps(_mm_setzero_ps(), t);
_MM_TRANSPOSE4_PS(r, u, t, _mm_setr_ps(0.0f, 0.0f, 0.0f, 1.0f));  //AoS to SoA

//now normalize
__m128 den = _mm_add_ps(_mm_add_ps(_mm_mul_ps(r,r),_mm_mul_ps(u,u)), _mm_mul_ps(t,t));
__m128 norm = _mm_div_ps(_mm_set1_ps(1.0f), _mm_sqrt_ps(den));
r= _mm_mul_ps(norm,r); u =_mm_mul_ps(norm,u); t = _mm_mul_ps(norm,t);

norm不是单个标量。它包含四种不同的标准化（n1，n2，n3，n4），因此norm * r =（n1 * x1，n2 * x2，n3 * x3，n4 * x4）。有关使用SSE进行矩阵乘法的有效方法，请参阅此链接

Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point?

Answer 2

我弄明白了这个问题。我的乘法函数是以错误的顺序乘以矩阵。

我的lookEt的SSE实现不起作用

2 个答案: