Question

我使用Renderscript对图像进行高斯模糊处理。但不管我做了什么。 ScriptIntrinsicBlur更快。为什么会这样？ ScriptIntrinsicBlur正在使用另一种方法？这个我的RS代码：

#pragma version(1)
#pragma rs java_package_name(top.deepcolor.rsimage.utils)

//aussian blur algorithm.

//the max radius of gaussian blur
static const int MAX_BLUR_RADIUS = 1024;

//the ratio of pixels when blur
float blurRatio[(MAX_BLUR_RADIUS << 2) + 1];

//the acquiescent blur radius
int blurRadius = 0;

//the width and height of bitmap
uint32_t width;
uint32_t height;

//bind to the input bitmap
rs_allocation input;
//the temp alloction
rs_allocation temp;

//set the radius
void setBlurRadius(int radius)
{
    if(1 > radius)
        radius = 1;
    else if(MAX_BLUR_RADIUS < radius)
        radius = MAX_BLUR_RADIUS;

    blurRadius = radius;


    /**
    calculate the blurRadius by Gaussian function
    when the pixel is far way from the center, the pixel will not contribute to the center
    so take the sigma is blurRadius / 2.57
    */
    float sigma = 1.0f * blurRadius / 2.57f;
    float deno  = 1.0f / (sigma * sqrt(2.0f * M_PI));
    float nume  = -1.0 / (2.0f * sigma * sigma);

    //calculate the gaussian function
    float sum = 0.0f;
    for(int i = 0, r = -blurRadius; r <= blurRadius; ++i, ++r)
    {
        blurRatio[i] = deno * exp(nume * r * r);
        sum += blurRatio[i];
    }

    //normalization to 1
    int len = radius + radius + 1;
    for(int i = 0; i < len; ++i)
    {
        blurRatio[i] /= sum;
    }

}

/**
the gaussian blur is decomposed two steps:1
1.blur in the horizontal
2.blur in the vertical
*/
uchar4 RS_KERNEL horizontal(uint32_t x, uint32_t y)
{
    float a, r, g, b;

    for(int k = -blurRadius; k <= blurRadius; ++k)
    {
        int horizontalIndex = x + k;

        if(0 > horizontalIndex) horizontalIndex = 0;
        if(width <= horizontalIndex) horizontalIndex = width - 1;

        uchar4 inputPixel = rsGetElementAt_uchar4(input, horizontalIndex, y);

        int blurRatioIndex = k + blurRadius;
        a += inputPixel.a * blurRatio[blurRatioIndex];
        r += inputPixel.r * blurRatio[blurRatioIndex];
        g += inputPixel.g * blurRatio[blurRatioIndex];
        b += inputPixel.b * blurRatio[blurRatioIndex];
    }

    uchar4 out;

    out.a = (uchar) a;
    out.r = (uchar) r;
    out.g = (uchar) g;
    out.b = (uchar) b;

    return out;
}

uchar4 RS_KERNEL vertical(uint32_t x, uint32_t y)
{
    float a, r, g, b;

    for(int k = -blurRadius; k <= blurRadius; ++k)
    {
        int verticalIndex = y + k;

        if(0 > verticalIndex) verticalIndex = 0;
        if(height <= verticalIndex) verticalIndex = height - 1;

        uchar4 inputPixel = rsGetElementAt_uchar4(temp, x, verticalIndex);

        int blurRatioIndex = k + blurRadius;
        a += inputPixel.a * blurRatio[blurRatioIndex];
        r += inputPixel.r * blurRatio[blurRatioIndex];
        g += inputPixel.g * blurRatio[blurRatioIndex];
        b += inputPixel.b * blurRatio[blurRatioIndex];
    }

    uchar4 out;

    out.a = (uchar) a;
    out.r = (uchar) r;
    out.g = (uchar) g;
    out.b = (uchar) b;

    return out;
}

Answer 1

Renderscript内在函数的实现与您使用自己的脚本所实现的内容完全不同。这有几个原因，但主要是因为它们是由各个设备的RS驱动程序开发人员以尽可能最好地使用特定硬件/ SoC配置的方式构建的，并且很可能只是对硬件进行低级调用。在RS编程层不可用。

Android确实提供了这些内在函数的通用实现，以便在没有可用的较低硬件实现的情况下“退回”。看看这些通用方法是如何完成的，可以让您更好地了解这些内在函数的工作原理。例如，您可以在此处查看3x3卷积内在的通用实现的源代码rsCpuIntrinsicConvolve3x3.cpp。

仔细查看从该源文件的第98行开始的代码，并注意他们如何使用 no for loops 来进行卷积。这称为展开循环，您可以在代码中显式添加和乘以9个相应的内存位置，从而避免使用for循环结构。这是优化并行代码时必须考虑的第一条规则。你需要摆脱内核中的所有分支。看看你的代码，你有很多导致分支的if和for - 这意味着程序的控制流程从头到尾都不是直接的。

如果您展开for循环，您会立即看到性能提升。请注意，通过删除for结构，您将无法再针对所有可能的半径量推广内核。在这种情况下，您必须为不同的半径创建固定内核，这是完全为什么您看到单独的3x3和5x5卷积内在函数，因为这正是他们所做的。（见rsCpuIntrinsicConvolve5x5.cpp的5x5内在的第99行）。

此外，你有两个独立的内核这一事实没有帮助。如果你正在进行高斯模糊，那么卷积内核确实可以分离，你可以像在那里那样进行1xN + Nx1卷积，但我建议将两个传递放在同一个内核中。

请记住，即使做这些技巧也可能不会像实际的内在函数一样快速地提供结果，因为那些可能已针对您的特定设备进行了高度优化。

为什么ScriptIntrinsicBlur比我的方法更快？

1 个答案: