Question

以下代码是使用双线性插值放大图片。

哪些可以在slow_rescale函数中修改，以提高效率？

我希望从计算机组织原理的角度对其进行修改。

期待您的回答！

谢谢！

unsigned char *slow_rescale(unsigned char *src, int src_x, int src_y, int dest_x, int dest_y)
{
 double step_x,step_y;          // Step increase as per instructions above
 unsigned char R1,R2,R3,R4;     // Colours at the four neighbours
 unsigned char G1,G2,G3,G4;
 unsigned char B1,B2,B3,B4;
 double RT1, GT1, BT1;          // Interpolated colours at T1 and T2
 double RT2, GT2, BT2;
 unsigned char R,G,B;           // Final colour at a destination pixel
 unsigned char *dst;            // Destination image - must be allocated here! 
 int x,y;               // Coordinates on destination image
 double fx,fy;              // Corresponding coordinates on source image
 double dx,dy;              // Fractional component of source image    coordinates

 dst=(unsigned char *)calloc(dest_x*dest_y*3,sizeof(unsigned char));   // Allocate and clear   destination image
 if (!dst) return(NULL);                           // Unable to allocate image

 step_x=(double)(src_x-1)/(double)(dest_x-1);
 step_y=(double)(src_y-1)/(double)(dest_y-1);

 for (x=0;x<dest_x;x++)         // Loop over destination image
  for (y=0;y<dest_y;y++)
  {
    fx=x*step_x;
    fy=y*step_y;
    dx=fx-(int)fx;
    dy=fy-(int)fy;   
    getPixel(src,floor(fx),floor(fy),src_x,&R1,&G1,&B1);    // get N1 colours
    getPixel(src,ceil(fx),floor(fy),src_x,&R2,&G2,&B2); // get N2 colours
    getPixel(src,floor(fx),ceil(fy),src_x,&R3,&G3,&B3); // get N3 colours
    getPixel(src,ceil(fx),ceil(fy),src_x,&R4,&G4,&B4);  // get N4 colours
   // Interpolate to get T1 and T2 colours
   RT1=(dx*R2)+(1-dx)*R1;
   GT1=(dx*G2)+(1-dx)*G1;
   BT1=(dx*B2)+(1-dx)*B1;
   RT2=(dx*R4)+(1-dx)*R3;
   GT2=(dx*G4)+(1-dx)*G3;
   BT2=(dx*B4)+(1-dx)*B3;
   // Obtain final colour by interpolating between T1 and T2
   R=(unsigned char)((dy*RT2)+((1-dy)*RT1));
   G=(unsigned char)((dy*GT2)+((1-dy)*GT1));
   B=(unsigned char)((dy*BT2)+((1-dy)*BT1));
  // Store the final colour
  setPixel(dst,x,y,dest_x,R,G,B);
 }
  return(dst);
}
void getPixel(unsigned char *image, int x, int y, int sx, unsigned char *R, unsigned char *G, unsigned char *B)
{
 // Get the colour at pixel x,y in the image and return it using the provided RGB pointers
 // Requires the image size along the x direction!
 *(R)=*(image+((x+(y*sx))*3)+0);
 *(G)=*(image+((x+(y*sx))*3)+1);
 *(B)=*(image+((x+(y*sx))*3)+2);
}

void setPixel(unsigned char *image, int x, int y, int sx, unsigned char R, unsigned char G, unsigned char B)
{
 // Set the colour of the pixel at x,y in the image to the specified R,G,B
 // Requires the image size along the x direction!
 *(image+((x+(y*sx))*3)+0)=R;
 *(image+((x+(y*sx))*3)+1)=G;
 *(image+((x+(y*sx))*3)+2)=B;
}

Answer 1

我一直担心图像处理性能。以下是一些需要牢记的明显注意事项：

数值精度：

从代码中跳出来的第一件事是对步长，颜色值和坐标使用双精度数。您真的需要这些数量的精确度吗？如果没有，您可以进行一些分析，以便在使用固定点或浮点数时检查代码的性能。

请记住，这是一个与硬件相关的问题，性能可能是也可能不是问题，具体取决于您的硬件是否实现双重，仅浮动，或者两者都不实现（然后两者都在软件中实现）。关于这方面的讨论还包括内存对齐，合并内存访问等。当然这些主题涉及“计算机组织原理”，还有更多discussion on this topic is here。

循环展开：

您是否也考虑过手动loop unrolling？这可能有所帮助，也可能没有帮助，因为您的编译器可能已经尝试利用这种优化，但至少值得考虑，因为您对可能较大的数组大小进行了双循环。

数字冗余：

在你的getPixel（）函数中，你还为每个RGB组件计算image+((x+(y*sx))*3，这似乎没有改变，为什么不在函数开始时计算一次这个数量？

矢量处理：

在没有首先想知道是否可以利用矢量处理的情况下，很难考虑优化这样的代码。您是否可以访问矢量化指令集，例如SSE？

并行处理：

大多数系统都安装了OpenMP。如果是这样，您可以考虑重构代码以利用处理器的多核功能。使用pragma实现这是非常简单的，它当然值得一试。

编译器标志：

此外，虽然您没有直接提及它，但编译标志会影响C代码的性能。例如，如果使用gcc，您可以使用以下方法比较性能差异：

gcc -std=c99 -o main main.c

VS

gcc -std=c99 -O3 -o main main.c

Answer 2

以下是一些想法：

使用fixed-point arithmetic代替浮点。这样可以更快地进行floor和ceil（以及可能的乘法，但我不确定）的计算。
将ceil(x)替换为floor(x)+1
使用strength reduction替换fx=x*step_x中的加法
如果你知道内存中像素的布局，请用更有效的方法替换getPixel
使用以下代码转换将两次乘法减少为一次：(dx*R2)+(1-dx)*R1 ==＆gt; R1+dx*(R2-R1)
Unroll the inner loop
（最后，但可能最具潜力）使用矢量化编译器或手动编辑代码以使用SSE或其他SIMD技术（如果在您的平台上可用）

Answer 3

此代码中的乘法运算可以大大减少。

dx可以在外部循环中计算，我们可以为RT1=(dx*R2)+(1-dx)*R1等进一步的操作准备乘法表，因为乘法（R2，R1等）的大小为1字节。

以下代码的运行速度比我的机器上的原始代码快〜10倍（Mac OS，带有-O3的Mac C ++编译器）：

#include <stdio.h>
#include <math.h>
#include <stdlib.h>

inline void fast_getPixel(unsigned char *image, int x, int y, int sx, unsigned char *R, unsigned char *G, unsigned char *B)
{
    // Get the colour at pixel x,y in the image and return it using the provided RGB pointers
    // Requires the image size along the x direction!
    unsigned char *ptr = image+((x+(y*sx))*3);
    *R=ptr[0];
    *G=ptr[1];
    *B=ptr[2];
}

inline void fast_setPixel(unsigned char *image, int x, int y, int sx, unsigned char R, unsigned char G, unsigned char B)
{
    // Set the colour of the pixel at x,y in the image to the specified R,G,B
    // Requires the image size along the x direction!
    unsigned char *ptr = image+((x+(y*sx))*3);
    ptr[0]=R;
    ptr[1]=G;
    ptr[2]=B;
}

void build_dx_table(double* table,double dx)
{
    unsigned len = 0xff;
    table[0] = 0;
    for (unsigned i=1;i<len;i++)
    {
        table[i] = table[i-1]+dx;
    }
}

unsigned char *fast_rescale(unsigned char *src, int src_x, int src_y, int dest_x, int dest_y)
{
    double step_x,step_y;          // Step increase as per instructions above
    unsigned char R1,R2,R3,R4;     // Colours at the four neighbours
    unsigned char G1,G2,G3,G4;
    unsigned char B1,B2,B3,B4;
    double RT1, GT1, BT1;          // Interpolated colours at T1 and T2
    double RT2, GT2, BT2;
    unsigned char R,G,B;           // Final colour at a destination pixel
    unsigned char *dst;            // Destination image - must be allocated here!
    int x,y;               // Coordinates on destination image
    double fx,fy;              // Corresponding coordinates on source image
    double dx,dy;              // Fractional component of source image    coordinates
    double dxtable[0xff];

    dst=(unsigned char *)calloc(dest_x*dest_y*3,sizeof(unsigned char));   // Allocate and clear   destination image
    if (!dst) return(NULL);                           // Unable to allocate image

    step_x=(double)(src_x-1)/(double)(dest_x-1);
    step_y=(double)(src_y-1)/(double)(dest_y-1);

    for (x=0,fx=0;x<dest_x;x++,fx+=step_x)         // Loop over destination image
        dx=fx-(int)fx;
        build_dx_table(dxtable,dx);
        for (y=0,fy=0;y<dest_y;y++,fy+=step_y)
        {
            dy=fy-(int)fy;
            fast_getPixel(src,floor(fx),floor(fy),src_x,&R1,&G1,&B1);    // get N1 colours
            fast_getPixel(src,ceil(fx),floor(fy),src_x,&R2,&G2,&B2); // get N2 colours
            fast_getPixel(src,floor(fx),ceil(fy),src_x,&R3,&G3,&B3); // get N3 colours
            fast_getPixel(src,ceil(fx),ceil(fy),src_x,&R4,&G4,&B4);  // get N4 colours
            // Interpolate to get T1 and T2 colours
            RT1=dxtable[R2-R1]+R1;
            GT1=dxtable[G2-G1]+G1;
            BT1=dxtable[B2-B1]+B1;
            RT2=dxtable[R4-R3]+R3;
            GT2=dxtable[G4-G3]+G3;
            BT2=dxtable[B4-B3]+B3;
            // Obtain final colour by interpolating between T1 and T2
            R=(unsigned char)(dy*(RT2-RT1)+RT1);
            G=(unsigned char)(dy*(GT2-GT1)+GT1);
            B=(unsigned char)(dy*(BT2-BT1)+BT1);
            // Store the final colour
            fast_setPixel(dst,x,y,dest_x,R,G,B);
        }
    return(dst);
}

Answer 4

GPU有硬件可以为您进行双线性插值。在CPU上执行此操作就像在软件中执行浮点运算而不使用浮点硬件（例如x87或SSE / AVX）。我最好的建议是考虑优化算法，如bicubic interpolation或一般图像过滤器，这些算法可能会提供更好的视觉效果，而大多数GPU都不支持这些算法。尽管图形宝石很古老，但它在“通用滤波图像重新缩放”方面有很好的部分，无论是用于制造还是缩小。

但是，如果您仍想在CPU上进行双线性插值，则应考虑CPU上的硬件加速。在那种情况下，我会考虑使用SIMD。请参阅此链接bilinear-pixel-interpolation-using-sse，其中显示了如何使用SSE进行双线性插值。我测试了这段代码，SSE代码要快得多。您可以将其与OpenMP结合使用，以在大图像上使用多个线程。

我还测试了定点代码，发现它比MSVC2010的非SSE代码提供了更好的结果，但在MSVC2012中却没有。我希望对于大多数现代编译器来说，定点代码不会更好，除非它在没有浮点硬件的嵌入式系统上运行。

如何使下面的双线性插值代码更有效？

4 个答案: