Question

我正在考虑如何使用SSE指令实现将整数（4byte，unsigned）转换为字符串。通常的例程是将数字除以并将其存储在局部变量中，然后反转字符串（在此示例中缺少反转例程）：

char *convert(unsigned int num, int base) {
    static char buff[33];  

    char *ptr;    
    ptr = &buff[sizeof(buff) - 1];    
    *ptr = '\0';

    do {
        *--ptr="0123456789abcdef"[num%base];
        num /= base;
    } while(num != 0);

    return ptr;
}

但倒置需要额外的时间。是否有任何其他算法可以优先使用SSE指令来并行化函数？

Answer 1

Terje Mathisen发明了一种非常快速的itoa（），它不需要查找表。如果您对其工作原理的解释不感兴趣，请跳至性能或实施。

超过15年前，Terje Mathisen为基数10想出了一个并行化的itoa（）。这个想法是取32位值并将其分成两个5位数的块。（Google快速搜索＆＃34; Terje Mathisen itoa＆＃34;发表此帖：http://computer-programming-forum.com/46-asm/7aa4b50bce8dd985.htm）

我们这样开始：

void itoa(char *buf, uint32_t val)
{
    lo = val % 100000;
    hi = val / 100000;
    itoa_half(&buf[0], hi);
    itoa_half(&buf[5], lo);
}

现在我们可以只需要一个可以将域[0,99999]中的任何整数转换为字符串的算法。一种天真的方式可能是：

// 0 <= val <= 99999
void itoa_half(char *buf, uint32_t val)
{
    // Move all but the first digit to the right of the decimal point.
    float tmp = val / 10000.0;

    for(size_t i = 0; i < 5; i++)
    {
        // Extract the next digit.
        int digit = (int) tmp;

        // Convert to a character.
        buf[i] = '0' + (char) digit;

        // Remove the lead digit and shift left 1 decimal place.
        tmp = (tmp - digit) * 10.0;
    }
}

我们将使用4.28定点数学，而不是使用浮点数，因为它在我们的情况下明显更快。也就是说，我们将二进制点固定在第28位位置，使得1.0表示为2 ^ 28。要转换为定点，我们只需乘以2 ^ 28。我们可以通过使用0xf0000000屏蔽来轻松舍入到最接近的整数，并且我们可以通过使用0x0fffffff屏蔽来提取小数部分。

（注意：Terje的算法在定点格式选择方面略有不同。）

现在我们有：

typedef uint32_t fix4_28;

// 0 <= val <= 99999
void itoa_half(char *buf, uint32_t val)
{
    // Convert `val` to fixed-point and divide by 10000 in a single step.
    // N.B. we would overflow a uint32_t if not for the parentheses.
    fix4_28 tmp = val * ((1 << 28) / 10000);

    for(size_t i = 0; i < 5; i++)
    {
        int digit = (int)(tmp >> 28);
        buf[i] = '0' + (char) digit;
        tmp = (tmp & 0x0fffffff) * 10;
    }
}

此代码的唯一问题是2 ^ 28/10000 = 26843.5456，它被截断为26843.这会导致某些值不准确。例如，itoa_half（buf，83492）生成字符串＆＃34; 83490＆＃34;。如果我们在转换到4.28定点时应用一个小的修正，那么该算法适用于域[0,99999]中的所有数字：

// 0 <= val <= 99999
void itoa_half(char *buf, uint32_t val)
{
    fix4_28 const f1_10000 = (1 << 28) / 10000;

    // 2^28 / 10000 is 26843.5456, but 26843.75 is sufficiently close.
    fix4_28 tmp = val * ((f1_10000 + 1) - (val / 4);

    for(size_t i = 0; i < 5; i++)
    {
        int digit = (int)(tmp >> 28);
        buf[i] = '0' + (char) digit;
        tmp = (tmp & 0x0fffffff) * 10;
    }
}

Terje将itoa_half部分交错为低和低。高一半：

void itoa(char *buf, uint32_t val)
{
    fix4_28 const f1_10000 = (1 << 28) / 10000;
    fix4_28 tmplo, tmphi;

    lo = val % 100000;
    hi = val / 100000;

    tmplo = lo * (f1_10000 + 1) - (lo / 4);
    tmphi = hi * (f1_10000 + 1) - (hi / 4);

    for(size_t i = 0; i < 5; i++)
    {
        buf[i + 0] = '0' + (char)(tmphi >> 28);
        buf[i + 5] = '0' + (char)(tmplo >> 28);
        tmphi = (tmphi & 0x0fffffff) * 10;
        tmplo = (tmplo & 0x0fffffff) * 10;
    }
}

如果循环完全展开，还有一个额外的技巧可以使代码稍快一些。乘以10实现为LEA + SHL或LEA + ADD序列。我们可以通过乘以5来保存1条指令，这只需要一个LEA。这与通过循环将tmphi和tmplo右移1个位置具有相同的效果，但我们可以通过调整我们的移位计数和掩码进行补偿：

uint32_t mask = 0x0fffffff;
uint32_t shift = 28;

for(size_t i = 0; i < 5; i++)
{
    buf[i + 0] = '0' + (char)(tmphi >> shift);
    buf[i + 5] = '0' + (char)(tmplo >> shift);
    tmphi = (tmphi & mask) * 5;
    tmplo = (tmplo & mask) * 5;
    mask >>= 1;
    shift--;
}

这只有在循环完全展开时才有用，因为你可以为每次迭代预先计算shift和mask的值。

最后，此例程产生零填充结果。如果val == 0，你可以通过返回指向非0的第一个字符或最后一个字符来删除填充：

char *itoa_unpadded(char *buf, uint32_t val)
{
    char *p;
    itoa(buf, val);

    p = buf;

    // Note: will break on GCC, but you can work around it by using memcpy() to dereference p.
    if (*((uint64_t *) p) == 0x3030303030303030)
        p += 8;

    if (*((uint32_t *) p) == 0x30303030)
        p += 4;

    if (*((uint16_t *) p) == 0x3030)
        p += 2;

    if (*((uint8_t *) p) == 0x30)
        p += 1;

    return min(p, &buf[15]);
}

还有一个适用于64位（即AMD64）代码的附加技巧。额外的，更宽的寄存器可以有效地在寄存器中累积每个5位组;在计算完最后一位数后，您可以将它们与SHRD一起粉碎，或者将它们与0x3030303030303030一起粉碎，并存储到存储器中。这使我的表现提高了约12.3％。

矢量

我们可以在SSE单元上按原样执行上述算法，但性能几乎没有增加。但是，如果我们将值拆分为较小的块，我们可以利用SSE4.1 32位乘法指令。我尝试了三种不同的分裂：

2组5位数
3组4位数
4组3位数

最快的变体是4组3位数。请参阅下面的结果。

效果

除了vitaut和Inge Henriksen建议的算法之外，我测试了许多Terje算法的变体。我通过对输入的详尽测试验证了每个算法的输出与itoa（）匹配。

我的号码来自运行Windows 7 64位的Westmere E5640。我以实时优先级为基准并锁定到核心0.我执行每个算法4次以强制所有内容进入缓存。我使用RDTSCP对2 ^ 24个呼叫进行计时，以消除任何动态时钟速度变化的影响。

我定时了5种不同的输入模式：

itoa（0 .. 9） - 几乎是最好的表现
itoa（1000 .. 1999） - 输出更长，没有分支误预测
itoa（100000000 .. 999999999） - 最长输出，没有分支误预测
itoa（256个随机值） - 不同的输出长度
itoa（65536个随机值） - 不同的输出长度和打乱L1 / L2缓存

数据：

ALG        TINY     MEDIUM   LARGE    RND256   RND64K   NOTES
NULL         7 clk    7 clk    7 clk    7 clk    7 clk  Benchmark overhead baseline
TERJE_C     63 clk   62 clk   63 clk   57 clk   56 clk  Best C implementation of Terje's algorithm
TERJE_ASM   48 clk   48 clk   50 clk   45 clk   44 clk  Naive, hand-written AMD64 version of Terje's algorithm
TERJE_SSE   41 clk   42 clk   41 clk   34 clk   35 clk  SSE intrinsic version of Terje's algorithm with 1/3/3/3 digit grouping
INGE_0      12 clk   31 clk   71 clk   72 clk   72 clk  Inge's first algorithm
INGE_1      20 clk   23 clk   45 clk   69 clk   96 clk  Inge's second algorithm
INGE_2      18 clk   19 clk   32 clk   29 clk   36 clk  Improved version of Inge's second algorithm
VITAUT_0     9 clk   16 clk   32 clk   35 clk   35 clk  vitaut's algorithm
VITAUT_1    11 clk   15 clk   33 clk   31 clk   30 clk  Improved version of vitaut's algorithm
LIBC        46 clk  128 clk  329 clk  339 clk  340 clk  MSVCRT12 implementation

我的编译器（VS 2013 Update 4）产生了令人惊讶的糟糕代码; Terje算法的汇编版本只是一个简单的翻译，它的速度提高了21％。我也对SSE实现的性能感到惊讶，我预计它会更慢。令人惊讶的是INGE_2，VITAUT_0和VITAUT_1的速度有多快。 Bravo to vitaut提出了一种便携式解决方案，即使是在装配级别也能尽力而为。

注意：INGE_1是Inge Henriksen的第二个算法的修改版本，因为原始版本有错误。

INGE_2基于Inge Henriksen提供的第二种算法。它不是将指针存储在char * []数组中的预先计算的字符串中，而是将字符串本身存储在char [] [5]数组中。另一个重大改进是如何将字符存储在输出缓冲区中。它存储的字符多于必要的字符数，并使用指针算法返回指向第一个非零字符的指针。结果大大加快 - 即使使用SSE优化版本的Terje算法也具有竞争力。应该注意的是，微基准测试有点偏爱这个算法，因为在实际应用中，600K数据集会不断地破坏缓存。

VITAUT_1基于vitaut算法，有两个小的变化。第一个变化是它在主循环中复制字符对，减少了存储指令的数量。与INGE_2类似，VITAUT_1复制两个最终字符并使用指针算法返回指向字符串的指针。

实施

在这里，我为3个最有趣的算法提供代码。

TERJE_ASM：

; char *itoa_terje_asm(char *buf<rcx>, uint32_t val<edx>)
;
; *** NOTE ***
; buf *must* be 8-byte aligned or this code will break!
itoa_terje_asm:
    MOV     EAX, 0xA7C5AC47
    ADD     RDX, 1
    IMUL    RAX, RDX
    SHR     RAX, 48          ; EAX = val / 100000

    IMUL    R11D, EAX, 100000
    ADD     EAX, 1
    SUB     EDX, R11D        ; EDX = (val % 100000) + 1

    IMUL    RAX, 214748      ; RAX = (val / 100000) * 2^31 / 10000
    IMUL    RDX, 214748      ; RDX = (val % 100000) * 2^31 / 10000

    ; Extract buf[0] & buf[5]
    MOV     R8, RAX
    MOV     R9, RDX
    LEA     EAX, [RAX+RAX]   ; RAX = (RAX * 2) & 0xFFFFFFFF
    LEA     EDX, [RDX+RDX]   ; RDX = (RDX * 2) & 0xFFFFFFFF
    LEA     RAX, [RAX+RAX*4] ; RAX *= 5
    LEA     RDX, [RDX+RDX*4] ; RDX *= 5
    SHR     R8, 31           ; R8 = buf[0]
    SHR     R9, 31           ; R9 = buf[5]

    ; Extract buf[1] & buf[6]
    MOV     R10, RAX
    MOV     R11, RDX
    LEA     EAX, [RAX+RAX]   ; RAX = (RAX * 2) & 0xFFFFFFFF
    LEA     EDX, [RDX+RDX]   ; RDX = (RDX * 2) & 0xFFFFFFFF
    LEA     RAX, [RAX+RAX*4] ; RAX *= 5
    LEA     RDX, [RDX+RDX*4] ; RDX *= 5
    SHR     R10, 31 - 8
    SHR     R11, 31 - 8
    AND     R10D, 0x0000FF00 ; R10 = buf[1] << 8
    AND     R11D, 0x0000FF00 ; R11 = buf[6] << 8
    OR      R10D, R8D        ; R10 = buf[0] | (buf[1] << 8)
    OR      R11D, R9D        ; R11 = buf[5] | (buf[6] << 8)

    ; Extract buf[2] & buf[7]
    MOV     R8, RAX
    MOV     R9, RDX
    LEA     EAX, [RAX+RAX]   ; RAX = (RAX * 2) & 0xFFFFFFFF
    LEA     EDX, [RDX+RDX]   ; RDX = (RDX * 2) & 0xFFFFFFFF
    LEA     RAX, [RAX+RAX*4] ; RAX *= 5
    LEA     RDX, [RDX+RDX*4] ; RDX *= 5
    SHR     R8, 31 - 16
    SHR     R9, 31 - 16
    AND     R8D, 0x00FF0000  ; R8 = buf[2] << 16
    AND     R9D, 0x00FF0000  ; R9 = buf[7] << 16
    OR      R8D, R10D        ; R8 = buf[0] | (buf[1] << 8) | (buf[2] << 16)
    OR      R9D, R11D        ; R9 = buf[5] | (buf[6] << 8) | (buf[7] << 16)

    ; Extract buf[3], buf[4], buf[8], & buf[9]
    MOV     R10, RAX
    MOV     R11, RDX
    LEA     EAX, [RAX+RAX]   ; RAX = (RAX * 2) & 0xFFFFFFFF
    LEA     EDX, [RDX+RDX]   ; RDX = (RDX * 2) & 0xFFFFFFFF
    LEA     RAX, [RAX+RAX*4] ; RAX *= 5
    LEA     RDX, [RDX+RDX*4] ; RDX *= 5
    SHR     R10, 31 - 24
    SHR     R11, 31 - 24
    AND     R10D, 0xFF000000 ; R10 = buf[3] << 24
    AND     R11D, 0xFF000000 ; R11 = buf[7] << 24
    AND     RAX, 0x80000000  ; RAX = buf[4] << 31
    AND     RDX, 0x80000000  ; RDX = buf[9] << 31
    OR      R10D, R8D        ; R10 = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24)
    OR      R11D, R9D        ; R11 = buf[5] | (buf[6] << 8) | (buf[7] << 16) | (buf[8] << 24)
    LEA     RAX, [R10+RAX*2] ; RAX = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24) | (buf[4] << 32)
    LEA     RDX, [R11+RDX*2] ; RDX = buf[5] | (buf[6] << 8) | (buf[7] << 16) | (buf[8] << 24) | (buf[9] << 32)

    ; Compact the character strings
    SHL     RAX, 24          ; RAX = (buf[0] << 24) | (buf[1] << 32) | (buf[2] << 40) | (buf[3] << 48) | (buf[4] << 56)
    MOV     R8, 0x3030303030303030
    SHRD    RAX, RDX, 24     ; RAX = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24) | (buf[4] << 32) | (buf[5] << 40) | (buf[6] << 48) | (buf[7] << 56)
    SHR     RDX, 24          ; RDX = buf[8] | (buf[9] << 8)

    ; Store 12 characters. The last 2 will be null bytes.
    OR      R8, RAX
    LEA     R9, [RDX+0x3030]
    MOV     [RCX], R8
    MOV     [RCX+8], R9D

    ; Convert RCX into a bit pointer.
    SHL     RCX, 3

    ; Scan the first 8 bytes for a non-zero character.
    OR      EDX, 0x00000100
    TEST    RAX, RAX
    LEA     R10, [RCX+64]
    CMOVZ   RAX, RDX
    CMOVZ   RCX, R10

    ; Scan the next 4 bytes for a non-zero character.
    TEST    EAX, EAX
    LEA     R10, [RCX+32]
    CMOVZ   RCX, R10
    SHR     RAX, CL          ; N.B. RAX >>= (RCX % 64); this works because buf is 8-byte aligned.

    ; Scan the next 2 bytes for a non-zero character.
    TEST    AX, AX
    LEA     R10, [RCX+16]
    CMOVZ   RCX, R10
    SHR     EAX, CL          ; N.B. RAX >>= (RCX % 32)

    ; Convert back to byte pointer. N.B. this works because the AMD64 virtual address space is 48-bit.
    SAR     RCX, 3

    ; Scan the last byte for a non-zero character.
    TEST    AL, AL
    MOV     RAX, RCX
    LEA     R10, [RCX+1]
    CMOVZ   RAX, R10

    RETN

INGE_2：

uint8_t len100K[100000];
char str100K[100000][5];

void itoa_inge_2_init()
{
    memset(str100K, '0', sizeof(str100K));

    for(uint32_t i = 0; i < 100000; i++)
    {
        char buf[6];
        itoa(i, buf, 10);
        len100K[i] = strlen(buf);
        memcpy(&str100K[i][5 - len100K[i]], buf, len100K[i]);
    }
}

char *itoa_inge_2(char *buf, uint32_t val)
{
    char *p = &buf[10];
    uint32_t prevlen;

    *p = '\0';

    do
    {
        uint32_t const old = val;
        uint32_t mod;

        val /= 100000;
        mod = old - (val * 100000);

        prevlen = len100K[mod];
        p -= 5;
        memcpy(p, str100K[mod], 5);
    }
    while(val != 0);

    return &p[5 - prevlen];
}

VITAUT_1：

static uint16_t const str100p[100] = {
    0x3030, 0x3130, 0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830, 0x3930,
    0x3031, 0x3131, 0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731, 0x3831, 0x3931,
    0x3032, 0x3132, 0x3232, 0x3332, 0x3432, 0x3532, 0x3632, 0x3732, 0x3832, 0x3932,
    0x3033, 0x3133, 0x3233, 0x3333, 0x3433, 0x3533, 0x3633, 0x3733, 0x3833, 0x3933,
    0x3034, 0x3134, 0x3234, 0x3334, 0x3434, 0x3534, 0x3634, 0x3734, 0x3834, 0x3934,
    0x3035, 0x3135, 0x3235, 0x3335, 0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935,
    0x3036, 0x3136, 0x3236, 0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936,
    0x3037, 0x3137, 0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937,
    0x3038, 0x3138, 0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938,
    0x3039, 0x3139, 0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839, 0x3939, };

char *itoa_vitaut_1(char *buf, uint32_t val)
{
    char *p = &buf[10];

    *p = '\0';

    while(val >= 100)
    {
        uint32_t const old = val;

        p -= 2;
        val /= 100;
        memcpy(p, &str100p[old - (val * 100)], sizeof(uint16_t));
    }

    p -= 2;
    memcpy(p, &str100p[val], sizeof(uint16_t));

    return &p[val < 10];
}

Answer 2

优化代码的第一步是摆脱任意基础支持。这是因为除以常数几乎肯定是乘法，但除以base就是除法，因为'0'+n比"0123456789abcdef"[n]更快（前者没有涉及内存）。

如果你需要超越它，你可以为你关心的基数中的每个字节（例如10）创建查找表，然后向每个字节添加（例如十进制）结果。如：

00 02 00 80 (input)

 0000000000 (place3[0x00])
+0000131072 (place2[0x02])
+0000000000 (place1[0x00])
+0000000128 (place0[0x80])
 ==========
 0000131200 (result)

Answer 3

http://sourceforge.net/projects/itoa/

它使用一个包含所有4位整数的大型静态const数组，并将其用于32位或64位转换为字符串。

便携式，无需特定的指令集。

我能找到的唯一更快的版本是汇编代码，限制为32位。

Answer 4

This post比较了几种整数到字符串转换的方法，即itoa。报告的最快方法是来自fmt library的fmt::FormatInt，其速度比sprintf / std::stringstream快8倍，比原始ltoa / {快5倍{1}}实施（实际数字当然可能因平台而异）。

与大多数其他方法itoa不同，它会传递数字。它还使用Alexandrescu的谈话Three Optimization Tips for C++的想法最小化整数除法的数量。实施可用here。

当然，如果C ++是一个选项，并且您不受fmt::FormatInt API限制。

免责声明：我是此方法的作者和fmt library。

Answer 5

有趣的问题。如果您只对10个基数itoa()感兴趣，那么我的示例速度是典型itoa()实现速度的10倍，速度快3倍。

第一个例子（3x表现）

第一个是itoa()的3倍，使用单通道非反转软件设计模式，基于开源itoa()实现在 groff 中找到。

// itoaSpeedTest.cpp : Defines the entry point for the console application.
//

#pragma comment(lib, "Winmm.lib") 
#include "stdafx.h"
#include "Windows.h"

#include <iostream>
#include <time.h>

using namespace std;

#ifdef _WIN32
/** a signed 32-bit integer value type */
#define _INT32 __int32
#else
/** a signed 32-bit integer value type */
#define _INT32 long int // Guess what a 32-bit integer is
#endif

/** minimum allowed value in a signed 32-bit integer value type */
#define _INT32_MIN -2147483647

/** maximum allowed value in a signed 32-bit integer value type */
#define _INT32_MAX 2147483647

/** maximum allowed number of characters in a signed 32-bit integer value type including a '-' */
#define _INT32_MAX_LENGTH 11

#ifdef _WIN32

/** Use to init the clock */
#define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency);

/** Use to start the performance timer */
#define TIMER_START QueryPerformanceCounter(&t1);

/** Use to stop the performance timer and output the result to the standard stream */
#define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl;
#else
/** Use to init the clock */
#define TIMER_INIT 

/** Use to start the performance timer */
#define TIMER_START clock_t start;double diff;start=clock();

/** Use to stop the performance timer and output the result to the standard stream */
#define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl;
#endif

/** Array used for fast number character lookup */
const char numbersIn10Radix[10] = {'0','1','2','3','4','5','6','7','8','9'};

/** Array used for fast reverse number character lookup */
const char reverseNumbersIn10Radix[10] = {'9','8','7','6','5','4','3','2','1','0'};
const char *reverseArrayEndPtr = &reverseNumbersIn10Radix[9];

/*!
\brief Converts a 32-bit signed integer to a string
\param i [in] Integer
\par Software design pattern
Uses a single pass non-reversing algorithm and is 3x as fast as \c itoa().
\returns Integer as a string
\copyright GNU General Public License
\copyright 1989-1992 Free Software Foundation, Inc.
\date 1989-1992, 2013
\author James Clark<jjc@jclark.com>, 1989-1992
\author Inge Eivind Henriksen<inge@meronymy.com>, 2013
\note Function was originally a part of \a groff, and was refactored & optimized in 2013.
\relates itoa()
*/
const char *Int32ToStr(_INT32 i) 
{   
    // Make room for a 32-bit signed integers digits and the '\0'
    char buf[_INT32_MAX_LENGTH + 2];
    char *p = buf + _INT32_MAX_LENGTH + 1;

    *--p = '\0';

    if (i >= 0) 
    {
        do 
        {
            *--p = numbersIn10Radix[i % 10];
            i /= 10;
        } while (i);
    }
    else
    {
        // Negative integer
        do
        {
            *--p = reverseArrayEndPtr[i % 10];
            i /= 10;
        } while (i);

        *--p = '-';
    }

    return p;
}

int _tmain(int argc, _TCHAR* argv[])
{
    TIMER_INIT

    // Make sure we are playing fair here
    if (sizeof(int) != sizeof(_INT32))
    {
        cerr << "Error: integer size mismatch; test would be invalid." << endl;
        return -1;
    }

    const int steps = 100;
    {
        char intBuffer[20];
        cout << "itoa() took:" << endl;
        TIMER_START;

        for (int i = _INT32_MIN; i < i + steps ; i += steps)
            itoa(i, intBuffer, 10);

        TIMER_STOP;
    }
    {
        cout << "Int32ToStr() took:" << endl;
        TIMER_START;

        for (int i = _INT32_MIN; i < i + steps ; i += steps)
            Int32ToStr(i);

        TIMER_STOP;
    }

    cout << "Done" << endl;
    int wait;
    cin >> wait;
    return 0;
}

在64位Windows上运行此示例的结果是：

itoa() took:
2909.84 ms.
Int32ToStr() took:
991.726 ms.
Done

在32位Windows上运行此示例的结果是：

itoa() took:
3119.6 ms.
Int32ToStr() took:
1031.61 ms.
Done

第二个例子（10倍表现）

如果你不介意花一些时间初始化一些缓冲区，那么就可以优化上面的函数，比典型的itoa()实现快10倍。你需要做的是创建字符串缓冲区而不是字符缓冲区，如下所示：

// itoaSpeedTest.cpp : Defines the entry point for the console application.
//

#pragma comment(lib, "Winmm.lib") 
#include "stdafx.h"
#include "Windows.h"

#include <iostream>
#include <time.h>

using namespace std;

#ifdef _WIN32
/** a signed 32-bit integer value type */
#define _INT32 __int32

/** a signed 8-bit integer value type */
#define _INT8 __int8

/** an unsigned 8-bit integer value type */
#define _UINT8 unsigned _INT8
#else
/** a signed 32-bit integer value type */
#define _INT32 long int // Guess what a 32-bit integer is

/** a signed 8-bit integer value type */
#define _INT8 char

/** an unsigned 8-bit integer value type */
#define _UINT8 unsigned _INT8
#endif

/** minimum allowed value in a signed 32-bit integer value type */
#define _INT32_MIN -2147483647

/** maximum allowed value in a signed 32-bit integer value type */
#define _INT32_MAX 2147483647

/** maximum allowed number of characters in a signed 32-bit integer value type including a '-' */
#define _INT32_MAX_LENGTH 11

#ifdef _WIN32

/** Use to init the clock */
#define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency);

/** Use to start the performance timer */
#define TIMER_START QueryPerformanceCounter(&t1);

/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl;
#else
/** Use to init the clock to get better precision that 15ms on Windows */
#define TIMER_INIT timeBeginPeriod(10);

/** Use to start the performance timer */
#define TIMER_START clock_t start;double diff;start=clock();

/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl;
#endif


 /* Set this as large or small as you want, but has to be in the form 10^n where n >= 1, setting it smaller will
 make the buffers smaller but the performance slower. If you want to set it larger than 100000 then you 
must add some more cases to the switch blocks. Try to make it smaller to see the difference in 
performance. It does however seem to become slower if larger than 100000 */
static const _INT32 numElem10Radix = 100000;

/** Array used for fast lookup number character lookup */
const char *numbersIn10Radix[numElem10Radix] = {};
_UINT8 numbersIn10RadixLen[numElem10Radix] = {};

/** Array used for fast lookup number character lookup */
const char *reverseNumbersIn10Radix[numElem10Radix] = {};
_UINT8 reverseNumbersIn10RadixLen[numElem10Radix] = {};

void InitBuffers()
{
    char intBuffer[20];

    for (int i = 0; i < numElem10Radix; i++)
    {
        itoa(i, intBuffer, 10);
        size_t numLen = strlen(intBuffer);
        char *intStr = new char[numLen + 1];
        strcpy(intStr, intBuffer);
        numbersIn10Radix[i] = intStr;
        numbersIn10RadixLen[i] = numLen;
        reverseNumbersIn10Radix[numElem10Radix - 1 - i] = intStr;
        reverseNumbersIn10RadixLen[numElem10Radix - 1 - i] = numLen;
    }
}

/*!
\brief Converts a 32-bit signed integer to a string
\param i [in] Integer
\par Software design pattern
Uses a single pass non-reversing algorithm with string buffers and is 10x as fast as \c itoa().
\returns Integer as a string
\copyright GNU General Public License
\copyright 1989-1992 Free Software Foundation, Inc.
\date 1989-1992, 2013
\author James Clark<jjc@jclark.com>, 1989-1992
\author Inge Eivind Henriksen, 2013
\note This file was originally a part of \a groff, and was refactored & optimized in 2013.
\relates itoa()
*/
const char *Int32ToStr(_INT32 i) 
{   
    /* Room for INT_DIGITS digits, - and '\0' */
    char buf[_INT32_MAX_LENGTH + 2];
    char *p = buf + _INT32_MAX_LENGTH + 1;
    _INT32 modVal;

    *--p = '\0';

    if (i >= 0) 
    {
        do 
        {
            modVal = i % numElem10Radix;

            switch(numbersIn10RadixLen[modVal])
            {
                case 5:
                    *--p = numbersIn10Radix[modVal][4];
                case 4:
                    *--p = numbersIn10Radix[modVal][3];
                case 3:
                    *--p = numbersIn10Radix[modVal][2];
                case 2:
                    *--p = numbersIn10Radix[modVal][1];
                default:
                    *--p = numbersIn10Radix[modVal][0];
            }

            i /= numElem10Radix;
        } while (i);
    }
    else
    {
        // Negative integer
        const char **reverseArray = &reverseNumbersIn10Radix[numElem10Radix - 1];
        const _UINT8 *reverseArrayLen = &reverseNumbersIn10RadixLen[numElem10Radix - 1];

        do
        {
            modVal = i % numElem10Radix;

            switch(reverseArrayLen[modVal])
            {
                case 5:
                    *--p = reverseArray[modVal][4];
                case 4:
                    *--p = reverseArray[modVal][3];
                case 3:
                    *--p = reverseArray[modVal][2];
                case 2:
                    *--p = reverseArray[modVal][1];
                default:
                    *--p = reverseArray[modVal][0];
            }

            i /= numElem10Radix;
        } while (i);

        *--p = '-';
    }

    return p;
}

int _tmain(int argc, _TCHAR* argv[])
{
    InitBuffers();

    TIMER_INIT

    // Make sure we are playing fair here
    if (sizeof(int) != sizeof(_INT32))
    {
        cerr << "Error: integer size mismatch; test would be invalid." << endl;
        return -1;
    }

    const int steps = 100;
    {
        char intBuffer[20];
        cout << "itoa() took:" << endl;
        TIMER_START;

        for (int i = _INT32_MIN; i < i + steps ; i += steps)
            itoa(i, intBuffer, 10);

        TIMER_STOP;
    }
    {
        cout << "Int32ToStr() took:" << endl;
        TIMER_START;

        for (int i = _INT32_MIN; i < i + steps ; i += steps)
            Int32ToStr(i);

        TIMER_STOP;
    }

    cout << "Done" << endl;
    int wait;
    cin >> wait;
    return 0;
}

在64位Windows上运行此示例的结果是：

itoa() took:
2914.12 ms.
Int32ToStr() took:
306.637 ms.
Done

在32位Windows上运行此示例的结果是：

itoa() took:
3126.12 ms.
Int32ToStr() took:
299.387 ms.
Done

为什么使用反向字符串查找缓冲区？

可以在没有反向字符串查找缓冲区的情况下执行此操作（从而节省1/2内部存储器），但这会使其显着变慢（在64位和380时间约为850 ms） ms在32位系统上）。我不清楚为什么它如此慢 - 特别是在64位系统上，为了进一步测试你自己可以改变以下代码：

#define _UINT32 unsigned _INT32 ... static const _UINT32 numElem10Radix = 100000; ... void InitBuffers() { char intBuffer[20]; for (int i = 0; i < numElem10Radix; i++) { _itoa(i, intBuffer, 10); size_t numLen = strlen(intBuffer); char *intStr = new char[numLen + 1]; strcpy(intStr, intBuffer); numbersIn10Radix[i] = intStr; numbersIn10RadixLen[i] = numLen; } } ... const char *Int32ToStr(_INT32 i) { char buf[_INT32_MAX_LENGTH + 2]; char *p = buf + _INT32_MAX_LENGTH + 1; _UINT32 modVal; *--p = '\0'; _UINT32 j = i; do { modVal = j % numElem10Radix; switch(numbersIn10RadixLen[modVal]) { case 5: *--p = numbersIn10Radix[modVal][4]; case 4: *--p = numbersIn10Radix[modVal][3]; case 3: *--p = numbersIn10Radix[modVal][2]; case 2: *--p = numbersIn10Radix[modVal][1]; default: *--p = numbersIn10Radix[modVal][0]; } j /= numElem10Radix; } while (j); if (i < 0) *--p = '-'; return p; }

Answer 6

我在asm中的代码部分。它仅适用于范围255-0它可以更快但是在这里你可以找到方向和主要想法。

4 imuls 1个内存读取 1个内存写

你可以尝试减少2个imule，并使用lea＆s;但是，你无法在C / C ++ / Python中找到更快的东西;）

void itoa_asm(unsigned char inVal, char *str)
{
    __asm
    {
        // eax=100's      -> (some_integer/100) = (some_integer*41) >> 12
        movzx esi,inVal
        mov eax,esi
        mov ecx,41
        imul eax,ecx
        shr eax,12

        mov edx,eax
        imul edx,100
        mov edi,edx

        // ebx=10's       -> (some_integer/10) = (some_integer*205) >> 11
        mov ebx,esi
        sub ebx,edx
        mov ecx,205
        imul ebx,ecx
        shr ebx,11

        mov edx,ebx
        imul edx,10

        // ecx = 1
        mov ecx,esi
        sub ecx,edx    // -> sub 10's
        sub ecx,edi    // -> sub 100's

        add al,'0'
        add bl,'0'
        add cl,'0'
        //shl eax,
        shl ebx,8
        shl ecx,16
        or eax,ebx
        or eax,ecx

        mov edi,str
        mov [edi],eax

    }

}

Answer 7

@Inge Henriksen

我相信你的代码有一个错误：

IntToStr(2701987) == "2701987" //Correct
IntToStr(27001987) == "2701987" //Incorrect

这就是你的代码错误的原因：

modVal = i % numElem10Radix;
switch (reverseArrayLen[modVal])
{
    case 5:
        *--p = reverseArray[modVal][4];
    case 4:
        *--p = reverseArray[modVal][3];
    case 3:
        *--p = reverseArray[modVal][2];
    case 2:
        *--p = reverseArray[modVal][1];
    default:
        *--p = reverseArray[modVal][0];
}

i /= numElem10Radix;

在＆＃34; 1987＆＃34;之前应该有前导0，这是＆＃34; 01987＆＃34;。但是在第一次迭代之后，你得到的是4位而不是5位。

所以，

IntToStr（27000000）=＆＃34; 2700＆＃34; //不正确的

Answer 8

对于无符号0到9,999,999，终止为null。（不包括99,999,999个）

void itoa(uint64_t u, char *out) // up to 9,999,999 with terminating zero
{
    *out = 0;
    do { 
        uint64_t n0 = u;
        *((uint64_t *)out) = (*((uint64_t *)out) << 8) | (n0 + '0' - (u /= 10) * 10);
    } while (u);
}

优化itoa功能

8 个答案:

矢量

效果

实施