如何在C ++中快速计算向量的归一化l1和l2范数?

时间:2016-12-30 07:57:43

标签: c++ algorithm vector time-complexity

我的矩阵 X d 维度空间中具有 n 列数据向量。 给定向量 xj v [j] l1 范数(所有 abs(xji)的总和) ), w [j] l2 范数的平方(所有 xji ^ 2 的总和), pj [ i] 是条目组合除以 l1 l2 范数。最后,我需要输出: pj,v,w 用于子区域应用程序。

// X = new double [d*n]; is the input.
double alpha = 0.5;
double *pj = new double[d];
double *x_abs = new double[d];
double *x_2 = new double[d];
double *v = new double[n]();
double *w = new double[n]();
for (unsigned long j=0; j<n; ++j) {
        jm = j*m;
        jd = j*d;
        for (unsigned long i=0; i<d; ++i) {
            x_abs[i] = abs(X[i+jd]);
            v[j] += x_abs[i];
            x_2[i] = x_abs[i]*x_abs[i];
            w[j] += x_2[i];    
        }
        for (unsigned long i=0; i<d; ++i){
            pj[i] = alpha*x_abs[i]/v[j]+(1-alpha)*x_2[i]/w[j];     
        }

   // functionA(pj){ ... ...}  for subsequent applications
} 
// functionB(v, w){ ... ...} for subsequent applications

我的上述算法需要 O(nd) Flops / Time-complexity,任何人都可以通过在C ++中使用building-functoin或new implementation来帮助我加速它?减少 O(nd)中的常量值对我来说也很有帮助。

1 个答案:

答案 0 :(得分:1)

让我猜一下:由于你遇到了与性能相关的问题,你的向量的维度非常大。
如果是这种情况,那么值得考虑&#34; CPU缓存局部性&#34; - 关于此in a cppcon14 presentation的一些有趣信息。
如果数据在CPU缓存中不可用,那么abs - 一旦可用时将其设置为正方形就会因CPU等待数据而相形见绌。

考虑到这一点,您可能希望尝试以下解决方案(没有可以提高性能的保证 - 编译器在优化代码时可能实际应用这些技术)

for (unsigned long j=0; j<n; ++j) {
        // use pointer arithmetic - at > -O0 the compiler will do it anyway
        double *start=X+j*d, *end=X+(j+1)*d;

        // this part avoid as much as possible the competition
        // on CPU caches between X and v/w.
        // Don't store the norms in v/w as yet, keep them in registers
        double l1norm=0, l2norm=0;
        for(double *src=start; src!=end; src++) {
            double val=*src;
            l1norm+=abs(src);
            l2norm+= src*src;
        }
        double pl1=alpha/l1norm, pl2=(1-alpha)*l2norm;
        for(double *src=start, *dst=pj; src!=end; src++, dst++) {
          // Yes, recomputing abs/sqr may actually save time by not
          // creating competition on CPU caches with x_abs and x_2
          double val=*src;
          *dst = pl1*abs(val) + pl2*val*val;
        }    
        // functionA(pj){ ... ...}  for subsequent applications

        // Think well if you really need v/w. If you really do,
        // at least there are two values to be sent for storage into memory,
        //meanwhile the CPU can actually load the next vector into cache
        v[j]=l1norm; w[j]=l2norm;
}
// functionB(v, w){ ... ...} for subsequent applications