Question

我目前正在测试朱莉娅（我曾与Matlab合作）

在matlab中，N ^ 3的计算速度比NxNxN慢。 N ^ 2和NxN不会发生这种情况。他们使用不同的算法来计算高阶指数，因为他们更喜欢准确而不是速度。

我认为朱莉娅做同样的事情。

我想问一下是否有办法迫使Julia使用乘法而不是默认算法来计算N的指数，至少对于立方指数来说。

前段时间我在matlab上做了一些测试。我把那段代码翻译成了朱莉娅。

代码链接： http://pastebin.com/bbeukhTc （我无法上传所有链接:(）

Matlab 2014上的脚本结果：

Exponente1

经过的时间是68.293793秒。（最小的17.7倍）

Exponente2

经过的时间是24.236218秒。（小的6.3倍）

Exponente3

经过的时间是3.853348秒。

Julia 0.46上的脚本结果：

Exponente1

18.423204秒（8.22 k分配：372.563 KB）（最小的51.6倍）

Exponente2

13.746904秒（9.02 k分配：407.332 KB）（最小的38.5倍）

Exponente3

0.356875秒（10.01 k分配：450.441 KB）

在我的测试中，julia比Matlab快，但我使用的是相对较旧的版本。我无法测试其他版本。

Answer 1

检查Julia的源代码：

julia/base/math.jl：

^(x::Float64, y::Integer) =
    box(Float64, powi_llvm(unbox(Float64,x), unbox(Int32,Int32(y))))
^(x::Float32, y::Integer) =
    box(Float32, powi_llvm(unbox(Float32,x), unbox(Int32,Int32(y))))

julia/base/fastmath.jl：

pow_fast{T<:FloatTypes}(x::T, y::Integer) = pow_fast(x, Int32(y))
pow_fast{T<:FloatTypes}(x::T, y::Int32) =
    box(T, Base.powi_llvm(unbox(T,x), unbox(Int32,y)))

我们可以看到Julia使用powi_llvm

检查llvm's source code：

define double @powi(double %F, i32 %power) {
; CHECK: powi:
; CHECK: bl __powidf2
        %result = call double @llvm.powi.f64(double %F, i32 %power)
    ret double %result
}

现在，__powidf2是一个有趣的功能：

COMPILER_RT_ABI double
__powidf2(double a, si_int b)
{
    const int recip = b < 0;
    double r = 1;
    while (1)
    {
        if (b & 1)
            r *= a;
        b /= 2;
        if (b == 0)
            break;
        a *= a;
    }
    return recip ? 1/r : r;
}

示例1：给定a = 2; b = 7：

 -              r          =   1
 - iteration 1: r = 1 *  2 =   2; b = (int)(7/2) = 3; a = 2 * 2 =  4
 - iteration 2: r = 2 *  4 =   8; b = (int)(3/2) = 1; a = 4 * 4 = 16
 - iteration 3: r = 8 * 16 = 128;

实施例2：给定a = 2; b = 8：

 -              r           =   1
 - iteration 1: r           =   1; b = (int)(8/2) = 4; a =  2 *  2 =   4
 - iteration 2: r           =   1; b = (int)(4/2) = 2; a =  4 *  4 =  16
 - iteration 3: r           =   1; b = (int)(2/2) = 1; a = 16 * 16 = 256
 - iteration 4: r = 1 * 256 = 256; b = (int)(1/2) = 0;

整数幂总是作为序列乘法实现。这就是为什么N ^ 3比N ^ 2慢。

jl_powi_llvm（在fastmath.jl中调用。“jl_”由宏扩展连接），另一方面，将指数转换为浮点并调用pow（）。 C源代码：

JL_DLLEXPORT jl_value_t *jl_powi_llvm(jl_value_t *a, jl_value_t *b)
{
    jl_value_t *ty = jl_typeof(a);
    if (!jl_is_bitstype(ty))
        jl_error("powi_llvm: a is not a bitstype");
    if (!jl_is_bitstype(jl_typeof(b)) || jl_datatype_size(jl_typeof(b)) != 4)
        jl_error("powi_llvm: b is not a 32-bit bitstype");
    jl_value_t *newv = newstruct((jl_datatype_t*)ty);
    void *pa = jl_data_ptr(a), *pr = jl_data_ptr(newv);
    int sz = jl_datatype_size(ty);
    switch (sz) {
    /* choose the right size c-type operation */
    case 4:
        *(float*)pr = powf(*(float*)pa, (float)jl_unbox_int32(b));
        break;
    case 8:
        *(double*)pr = pow(*(double*)pa, (double)jl_unbox_int32(b));
        break;
    default:
        jl_error("powi_llvm: runtime floating point intrinsics are not implemented for bit sizes other than 32 and 64");
    }
    return newv;
}

Answer 2

Lior的答案很棒。以下是您提出的问题的解决方案：是的，有一种方法可以强制使用乘法。它是@fastmath宏：

julia> @benchmark 1.1 ^ 3
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     999
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  16.00 bytes
  allocs estimate:  1
  minimum time:     13.00 ns (0.00% GC)
  median time:      14.00 ns (0.00% GC)
  mean time:        15.74 ns (6.14% GC)
  maximum time:     1.85 μs (98.16% GC)

julia> @benchmark @fastmath 1.1 ^ 3
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     2.00 ns (0.00% GC)
  median time:      3.00 ns (0.00% GC)
  mean time:        2.59 ns (0.00% GC)
  maximum time:     20.00 ns (0.00% GC)

请注意，使用@fastmath时，效果要好得多。

指数计算速度

2 个答案: