Question

如何实施尽可能快的Gaussian blur算法？

我将在Java中实现它，因此排除了GPU个解决方案。我的应用程序planetGenesis是跨平台的，所以我不想要JNI。

Answer 1

你应该使用高斯内核可分离的事实，即即您可以将2D卷积表示为两个1D卷积的组合。

如果滤波器很大，则使用空间域中的卷积等效于频率（傅里叶）域中的乘法的事实也是有意义的。这意味着您可以对图像和滤波器进行傅里叶变换，将（复数）结果相乘，然后进行逆傅立叶变换。 FFT（快速傅里叶变换）的复杂度是O（n log n），而卷积的复杂度是O（n ^ 2）。此外，如果您需要使用相同的滤镜模糊许多图像，则只需要对滤镜进行一次FFT。

如果您决定使用FFT，FFTW library是一个不错的选择。

Answer 2

数学运动员很可能知道这一点，但对其他人来说......

由于高斯的良好数学特性，您可以通过首先在图像的每一行上运行一维高斯模糊来快速模糊二维图像，然后在每列上运行一维模糊。

Answer 3

我找到了 Quasimondo : Incubator : Processing : Fast Gaussian Blur 。此方法包含许多近似值，例如使用整数和查找表而不是浮点数和浮点除法。我不知道现代Java代码中有多少加速。
Fast Shadows on Rectangles 使用B-splines进行近似算法。
Fast Gaussian Blur Algorithm in C# 声称有一些很酷的优化。
此外，David Everly的 Fast Gaussian Blur （PDF）有一种快速的高斯模糊处理方法。

我会尝试各种方法，对它们进行基准测试并在此处发布结果。

出于我的目的，我已经从互联网上复制并实现了基本（过程X-Y轴独立）方法和David Everly的快速高斯模糊方法。它们的参数不同，所以我无法直接比较它们。然而，对于大的模糊半径，后者经历的迭代次数要少得多。此外，后者是近似算法。

Answer 4

终极解决方案

我对这么多信息和实现感到非常困惑，我不知道应该相信哪一个。在我弄明白之后，我决定写自己的文章。我希望它可以节省你几个小时的时间。

Fastest Gaussian Blur (in linear time)

它包含源代码，（我希望）简短，干净，可以轻松地重写为任何其他语言。请投票，以便其他人可以看到它。

Answer 5

你可能想要盒子模糊，这要快得多。请参阅this link以获取精彩教程，并参阅copy & paste C code。

Answer 6

对于较大的模糊半径，请尝试应用box blur三次。这将非常接近高斯模糊，并且比真正的高斯模糊快得多。

Answer 7

我会考虑使用CUDA或其他一些GPU编程工具包，特别是如果你想使用更大的内核。如果做不到这一点，总会在装配中调整你的循环。

Answer 8

步骤1：SIMD 1维高斯模糊
第2步：转置
步骤3：重复步骤1

最好在小块上进行，因为全图像转置很慢，而使用PUNPCKs（PUNPCKHBW, PUNPCKHDQ, PUNPCKHWD, PUNPCKLBW, PUNPCKLDQ, PUNPCKLWD）链可以非常快速地完成小块转置。

Answer 9

在1D：

重复使用几乎任何内核都会导致高斯内核。对于高斯分布而言，这是非常酷的，这也是统计学家喜欢它的原因。因此，选择容易模糊的东西并多次应用。

例如，使用盒形内核很容易模糊。首先计算累积总和：

y(i) = y(i-1) + x(i)

然后：

blurred(i) = y(i+radius) - y(i-radius)

重复几次。

或者你可以使用各种IIR过滤器来回几次，这些过滤器同样很快。

2D或更高版本：

像DarenW所说的那样，一个接一个地在每个维度上模糊。

Answer 10

我为这项研究而努力解决这个问题，尝试了一种有趣的快速高斯模糊方法。首先，如上所述，最好将模糊分成两个1D模糊，但根据您的硬件实际计算像素值，您实际上可以预先计算所有可能的值并将它们存储在查找表中。

换句话说，预先计算Gaussian coefficient * input pixel value的每个组合。当然，您需要对系数进行抹黑，但我只想添加此解决方案。如果您有IEEE订阅，则可以在 Fast image blurring using Lookup Table for real time feature extraction 中阅读更多内容。

最终，我最终使用了CUDA：）

Answer 11

我已经改变了Ivan Kuckir对快速高斯模糊的实现，该模糊使用三次传递线性盒模糊到java。得到的过程是O（n），正如他所说at his own blog。如果您想了解更多有关为什么3时间框模糊与高斯模糊（3％）近似的信息，我的朋友可以查看box blur和Gaussian blur。

这是java实现。

@Override
public BufferedImage ProcessImage(BufferedImage image) {
    int width = image.getWidth();
    int height = image.getHeight();

    int[] pixels = image.getRGB(0, 0, width, height, null, 0, width);
    int[] changedPixels = new int[pixels.length];

    FastGaussianBlur(pixels, changedPixels, width, height, 12);

    BufferedImage newImage = new BufferedImage(width, height, image.getType());
    newImage.setRGB(0, 0, width, height, changedPixels, 0, width);

    return newImage;
}

private void FastGaussianBlur(int[] source, int[] output, int width, int height, int radius) {
    ArrayList<Integer> gaussianBoxes = CreateGausianBoxes(radius, 3);
    BoxBlur(source, output, width, height, (gaussianBoxes.get(0) - 1) / 2);
    BoxBlur(output, source, width, height, (gaussianBoxes.get(1) - 1) / 2);
    BoxBlur(source, output, width, height, (gaussianBoxes.get(2) - 1) / 2);
}

private ArrayList<Integer> CreateGausianBoxes(double sigma, int n) {
    double idealFilterWidth = Math.sqrt((12 * sigma * sigma / n) + 1);

    int filterWidth = (int) Math.floor(idealFilterWidth);

    if (filterWidth % 2 == 0) {
        filterWidth--;
    }

    int filterWidthU = filterWidth + 2;

    double mIdeal = (12 * sigma * sigma - n * filterWidth * filterWidth - 4 * n * filterWidth - 3 * n) / (-4 * filterWidth - 4);
    double m = Math.round(mIdeal);

    ArrayList<Integer> result = new ArrayList<>();

    for (int i = 0; i < n; i++) {
        result.add(i < m ? filterWidth : filterWidthU);
    }

    return result;
}

private void BoxBlur(int[] source, int[] output, int width, int height, int radius) {
    System.arraycopy(source, 0, output, 0, source.length);
    BoxBlurHorizantal(output, source, width, height, radius);
    BoxBlurVertical(source, output, width, height, radius);
}

private void BoxBlurHorizontal(int[] sourcePixels, int[] outputPixels, int width, int height, int radius) {
    int resultingColorPixel;
    float iarr = 1f / (radius + radius);
    for (int i = 0; i < height; i++) {
        int outputIndex = i * width;
        int li = outputIndex;
        int sourceIndex = outputIndex + radius;

        int fv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex]);
        int lv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex + width - 1]);
        float val = (radius) * fv;

        for (int j = 0; j < radius; j++) {
            val += Byte.toUnsignedInt((byte) (sourcePixels[outputIndex + j]));
        }

        for (int j = 0; j < radius; j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex++]) - fv;
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
        }

        for (int j = (radius + 1); j < (width - radius); j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex++]) - Byte.toUnsignedInt((byte) sourcePixels[li++]);
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
        }

        for (int j = (width - radius); j < width; j++) {
            val += lv - Byte.toUnsignedInt((byte) sourcePixels[li++]);
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
        }
    }
}

private void BoxBlurVertical(int[] sourcePixels, int[] outputPixels, int width, int height, int radius) {
    int resultingColorPixel;
    float iarr = 1f / (radius + radius + 1);
    for (int i = 0; i < width; i++) {
        int outputIndex = i;
        int li = outputIndex;
        int sourceIndex = outputIndex + radius * width;

        int fv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex]);
        int lv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex + width * (height - 1)]);
        float val = (radius + 1) * fv;

        for (int j = 0; j < radius; j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[outputIndex + j * width]);
        }
        for (int j = 0; j <= radius; j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex]) - fv;
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
            sourceIndex += width;
            outputIndex += width;
        }
        for (int j = radius + 1; j < (height - radius); j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex]) - Byte.toUnsignedInt((byte) sourcePixels[li]);
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
            li += width;
            sourceIndex += width;
            outputIndex += width;
        }
        for (int j = (height - radius); j < height; j++) {
            val += lv - Byte.toUnsignedInt((byte) sourcePixels[li]);
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
            li += width;
            outputIndex += width;
        }
    }
}

Answer 12

对于2d数据的高斯模糊有几种快速方法。你应该知道什么。

这是可分离的过滤器，因此只需要两次1d卷积。
对于大内核，您可以处理缩小的图像副本，而不是高级版本。
可以通过多个盒式过滤器（也可分离）进行良好的近似（可以调整迭代次数和内核大小）
存在O（n）复杂度算法（适用于任何内核大小），用于IIR滤波器的精确高斯近似。

您的选择取决于所需的速度，精度和实施复杂性。

Answer 13

尝试使用Box Blur，就像我在这里所做的那样： Approximating Gaussian Blur Using Extended Box Blur

这是最好的近似值。

使用积分图像可以使速度更快如果您这样做，请分享您的解决方案。

Answer 14

使用现已实施的新库回答这个旧问题（截至2016年），因为Java技术在Java技术方面有许多新进展。

正如其他几个答案中所建议的那样，CUDA是另一种选择。但 java现在有CUDA支持 。

IBM CUDA4J库：提供用于管理和访问GPU设备，库，内核和内存的Java API。使用这些新的API，可以编写管理GPU设备特性的Java程序，并通过Java内存模型，异常和自动资源管理的便利性将工作卸载到GPU。

Jcuda：NVIDIA CUDA和相关库的Java绑定。使用JCuda，可以从Java程序与CUDA运行时和驱动程序API进行交互。

Aparapi：允许Java开发人员通过在GPU上执行数据并行代码片段而不是局限于本地CPU来利用GPU和APU设备的计算能力。

一些 Java OpenCL绑定 库

https://github.com/ochafik/JavaCL：OpenCL的Java绑定：面向对象的OpenCL库，基于自动生成的低级绑定

http://jogamp.org/jocl/www/：OpenCL的Java绑定：面向对象的OpenCL库，基于自动生成的低级绑定

http://www.lwjgl.org/：OpenCL的Java绑定：自动生成的低级绑定和面向对象的便捷类

http://jocl.org/：OpenCL的Java绑定：低级绑定，是原始OpenCL API的1：1映射

以上所有这些库都有助于实现高斯模糊，而不是CPU上的任何Java实现。

Answer 15

我在不同的地方看到了几个答案，并在这里收集它们，这样我就可以试着将它们包裹起来并记住它们以供日后使用：

无论您使用哪种方法，filter horizontal and vertical dimensions separately使用1D滤镜而不是使用单个方形滤镜。

标准的“慢”方法：卷积滤波器
分辨率降低的图像的分层金字塔，如SIFT
由中心极限定理激发的重复框模糊。 Box Blur是Viola和Jones的人脸检测的核心，如果我没记错的话，他们称之为整体形象。我认为类似Haar的功能也使用类似的功能。
Stack Blur：卷积和盒子模糊方法之间的基于队列的替代方案
IIR filters
- Derich filter（Wikipedia）二阶IIR过滤器
- van Vliet filter我对此一无所知
- Bessel filters虽然对这些

在回顾了所有这些之后，我提醒说，简单，差的近似通常在实践中很有效。在一个不同的领域，Alex Krizhevsky发现ReLU比他的突破性AlexNet中的经典sigmoid功能更快，尽管它们乍一看似乎是Sigmoid的可怕近似。

Answer 16

来自CWP的Dave Hale有一个minejtk软件包，其中包括递归高斯滤波器（Deriche方法和Van Vliet方法）。可以在https://github.com/dhale/jtk/blob/0350c23f91256181d415ea7369dbd62855ac4460/core/src/main/java/edu/mines/jtk/dsp/RecursiveGaussianFilter.java

中找到java子例程。

对于高斯模糊（以及对高斯的导数），Deriche的方法似乎是一种非常好的方法。

最快的高斯模糊实现

16 个答案: