Question

我有一个包含数百万行的文件，每行有3个以空格分隔的浮点数。读取文件需要花费大量时间，因此我尝试使用内存映射文件读取它们，但发现问题不在于IO的速度，而在于解析的速度。

我当前的解析是获取流（称为文件）并执行以下操作

float x,y,z;
file >> x >> y >> z;

Stack Overflow中有人建议使用Boost.Spirit，但我找不到任何简单的教程来解释如何使用它。

我正在尝试找到一种简单而有效的方法来解析看起来像这样的一行：

"134.32 3545.87 3425"

我真的很感激一些帮助。我想用strtok来分割它，但我不知道如何将字符串转换为浮点数，我不太确定它是最好的方法。

我不介意解决方案是否会提升。我不介意它是不是有史以来最有效的解决方案，但我确信它可以加快速度。

提前致谢。

Answer 1

更新

由于Spirit X3可用于测试，我已经更新了基准测试。与此同时，我使用Nonius来获得统计上合理的基准。

以下所有图表均可供interactive online
使用
Benchmark CMake项目+使用的testdata在github上：https://github.com/sehe/bench_float_parsing

要点：

精神解析器速度最快。如果您可以使用C ++ 14，请考虑实验版Spirit X3：

以上是使用内存映射文件的措施。使用IOstreams，它会越来越慢，

但不像使用C / POSIX scanf函数调用的FILE*那么慢：

以下是OLD答案中的部分

我实施了Spirit版本，并与其他建议的答案进行了比较。

这是我的结果，所有测试都在同一个输入体上运行（515Mb的input.txt）。请参阅下面的确切规格。


  ^{（挂钟时间，以秒为单位，平均2次运行）}

令我惊讶的是，Boost Spirit最快，最优雅：
处理/报告错误

支持+/- Inf和NaN以及变量空白

根本没有检测到输入结束的问题（与其他mmap答案相反）
看起来不错：
bool ok = phrase_parse(f,l,               // source iterators
     (double_ > double_ > double_) % eol, // grammar
     blank,                               // skipper
     data);                               // output attribute
请注意boost::spirit::istreambuf_iterator难以言喻的慢得多（15秒+）。我希望这有帮助！

基准详情

所有解析均已完成vector的{{1}}。

使用
生成输入文件
struct float3 { float x,y,z; }
这会产生一个包含
等数据的515Mb文件
od -f -A none --width=12 /dev/urandom | head -n 11000000
使用以下程序编译程序：
     -2627.0056   -1.967235e-12  -2.2784738e+33
  -1.0664798e-27  -4.6421956e-23   -6.917859e+20
  -1.1080849e+36   2.8909405e-33   1.7888695e-12
  -7.1663235e+33  -1.0840628e+36   1.5343362e-12
  -3.1773715e-17  -6.3655537e-22   -8.797282e+31
    9.781095e+19   1.7378472e-37        63825084
  -1.2139188e+09  -5.2464635e-05  -2.1235992e-38
   3.0109424e+08   5.3939846e+30  -6.6146894e-20
使用
测量挂钟时间
g++ -std=c++0x -g -O3 -isystem -march=native test.cpp -o test -lboost_filesystem -lboost_iostreams

环境：

Linux桌面4.2.0-42-通用＃49-Ubuntu SMP x86_64
Intel（R）Core（TM）i7-3770K CPU @ 3.50GHz
32GiB RAM

完整代码

旧基准的完整代码位于edit history of this post，最新版本为on github

Answer 2

如果转换是瓶颈（很可能），你应该从使用中的不同可能性开始标准。从逻辑上讲，人们会期望它们非常接近，但实际上，它们并非总是如此：

您已经确定std::ifstream太慢了。
将内存映射数据转换为std::istringstream 几乎可以肯定不是一个好的解决方案;你首先要做的创建一个字符串，它将复制所有数据。
编写自己的streambuf直接从内存中读取，无需复制（或使用已弃用的std::istrstream）可能是一个解决方案，但如果问题确实存在的话转换......这仍然使用相同的转换例程。
您可以随时在内存映射上尝试fscanf或scanf 流。根据实施情况，它们可能会更快比各种istream实施。
使用strtod可能比其中任何一个都快。没必要为此标记：strtod跳过前导空格（包括'\n'），并有一个out参数放在哪里未读取的第一个字符的地址。最终条件是有点棘手，你的循环可能看起来有点像：

    char* begin;    //  Set to point to the mmap'ed data...
                    //  You'll also have to arrange for a '\0'
                    //  to follow the data.  This is probably
                    //  the most difficult issue.
    char* end;
    errno = 0;
    double tmp = strtod( begin, &end );
    while ( errno == 0 && end != begin ) {
        //  do whatever with tmp...
        begin = end;
        tmp = strtod( begin, &end );
    }

如果这些都不够快，你将不得不考虑实际数据。它可能有一些额外的约束，这意味着你可以写转换例程比更普通的例程更快; 例如strtod必须同时处理固定和科学问题即使有17位有效数字，必须100％准确。它还必须是特定于语言环境的。所有这一切都被添加了复杂性，这意味着要添加执行代码。但要注意：写一个有效和正确的转换例程，即使是一组有限的输入，是非平凡的;你真的必须这样做知道你在做什么。

编辑：

出于好奇，我进行了一些测试。除了前面提到的解决方案，我写了一个简单的定制转换它最多只能处理固定点（没科学）小数点后的五位数和小数点前的值必须符合int：

double
convert( char const* source, char const** endPtr )
{
    char* end;
    int left = strtol( source, &end, 10 );
    double results = left;
    if ( *end == '.' ) {
        char* start = end + 1;
        int right = strtol( start, &end, 10 );
        static double const fracMult[] 
            = { 0.0, 0.1, 0.01, 0.001, 0.0001, 0.00001 };
        results += right * fracMult[ end - start ];
    }
    if ( endPtr != nullptr ) {
        *endPtr = end;
    }
    return results;
}

（如果你真的使用它，你肯定会添加一些错误处理。这只是为了实验而被迅速打倒目的，读取我生成的测试文件， nothing 别的。）

界面正是strtod的界面，以简化编码。

我在两个环境中运行基准测试（在不同的机器上，所以任何时候的绝对值都不相关）。我拿到结果如下：

在Windows 7下，使用VC 11（/ O2）编译：

Testing Using fstream directly (5 iterations)...
    6.3528e+006 microseconds per iteration
Testing Using fscan directly (5 iterations)...
    685800 microseconds per iteration
Testing Using strtod (5 iterations)...
    597000 microseconds per iteration
Testing Using manual (5 iterations)...
    269600 microseconds per iteration

在Linux 2.6.18下，使用g ++ 4.4.2（-O2，IIRC）编译：

Testing Using fstream directly (5 iterations)...
    784000 microseconds per iteration
Testing Using fscanf directly (5 iterations)...
    526000 microseconds per iteration
Testing Using strtod (5 iterations)...
    382000 microseconds per iteration
Testing Using strtof (5 iterations)...
    360000 microseconds per iteration
Testing Using manual (5 iterations)...
    186000 microseconds per iteration

在所有情况下，我正在阅读554000行，每行3个随机生成[0...10000)范围内的浮点数。

最引人注目的是两者之间的巨大差异 Windows下的fstream和fscan（相对较小 fscan和strtod之间的差异。第二件事是简单的自定义转换函数获得了多少两个平台。必要的错误处理会降低它的速度一点点，但差异仍然很大。我期望一些改进，因为它不处理很多事情标准转换例程（如科学格式，非常非常小的数字，Inf和NaN，i18n等），但不是这个得多。

Answer 3

在开始之前，请确认这是您应用程序的慢速部分并获得测试工具，以便您可以衡量改进。

在我看来，

boost::spirit对此有点矫枉过正。试试fscanf

FILE* f = fopen("yourfile");
if (NULL == f) {
   printf("Failed to open 'yourfile'");
   return;
}
float x,y,z;
int nItemsRead = fscanf(f,"%f %f %f\n", &x, &y, &z);
if (3 != nItemsRead) {
   printf("Oh dear, items aren't in the right format.\n");
   return;
}

Answer 4

我会查看此相关帖子Using ifstream to read floats或How do I tokenize a string in C++，尤其是与C ++ String Toolkit Library相关的帖子。我已经使用了C strtok，C ++流，Boost标记器以及它们中的最佳部分，以便于使用C ++字符串工具包库。

Answer 5

使用C将是最快的解决方案。 ~~使用strtok拆分为令牌，然后~~转换为浮动strtof。或者，如果您知道确切的格式，请使用fscanf。

Answer 6

一个实质性的解决方案是在问题上抛出更多内核，产生多个线程。如果瓶颈只是CPU，你可以通过产生两个线程（在多核CPU上）来减少运行时间

其他一些提示：

尽量避免从库中解析函数，例如boost和/或std。它们带有错误检查条件，并且大部分处理时间用于执行这些检查。对于只有几次转换，他们很好，但在处理数百万的价值时却失败了。如果您已经知道数据格式正确，则可以编写（或查找）仅执行数据转换的自定义优化C函数
使用大容量内存缓冲区（假设10 MB），在其中加载文件块并在那里进行转换
除以et impera：将您的问题分解为更简单的问题：预处理您的文件，使其成为单行单浮点数，将每行拆分为“。”字符和转换整数而不是浮点数，然后合并两个整数来创建浮点数

Answer 7

我相信字符串处理中最重要的一条规则是“只读一次，一次一个字符”。我认为它总是更简单，更快速，更可靠。

我制作了简单的基准程序来展示它是多么简单。我的测试表明，此代码的运行速度比strtod版快40％。

#include <iostream>
#include <sstream>
#include <iomanip>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <sys/time.h>

using namespace std;

string test_generate(size_t n)
{
    srand((unsigned)time(0));
    double sum = 0.0;
    ostringstream os;
    os << std::fixed;
    for (size_t i=0; i<n; ++i)
    {
        unsigned u = rand();
        int w = 0;
        if (u > UINT_MAX/2)
            w = - (u - UINT_MAX/2);
        else
            w = + (u - UINT_MAX/2);
        double f = w / 1000.0;
        sum += f;

        os << f;
        os << " ";
    }
    printf("generated %f\n", sum);
    return os.str();
}

void read_float_ss(const string& in)
{
    double sum = 0.0;
    const char* begin = in.c_str();
    char* end = NULL;
    errno = 0;
    double f = strtod( begin, &end );
    sum += f;

    while ( errno == 0 && end != begin )
    {
        begin = end;
        f = strtod( begin, &end );
        sum += f;
    }
    printf("scanned %f\n", sum);
}

double scan_float(const char* str, size_t& off, size_t len)
{
    static const double bases[13] = {
        0.0, 10.0, 100.0, 1000.0, 10000.0,
        100000.0, 1000000.0, 10000000.0, 100000000.0,
        1000000000.0, 10000000000.0, 100000000000.0, 1000000000000.0,
    };

    bool begin = false;
    bool fail = false;
    bool minus = false;
    int pfrac = 0;

    double dec = 0.0;
    double frac = 0.0;
    for (; !fail && off<len; ++off)
    {
        char c = str[off];
        if (c == '+')
        {
            if (!begin)
                begin = true;
            else
                fail = true;
        }
        else if (c == '-')
        {
            if (!begin)
                begin = true;
            else
                fail = true;
            minus = true;
        }
        else if (c == '.')
        {
            if (!begin)
                begin = true;
            else if (pfrac)
                fail = true;
            pfrac = 1;
        }
        else if (c >= '0' && c <= '9')
        {
            if (!begin)
                begin = true;
            if (pfrac == 0)
            {
                dec *= 10;
                dec += c - '0';
            }
            else if (pfrac < 13)
            {
                frac += (c - '0') / bases[pfrac];
                ++pfrac;
            }
        }
        else
        {
            break;
        }
    }

    if (!fail)
    {
        double f = dec + frac;
        if (minus)
            f = -f;
        return f;
    }

    return 0.0;
}

void read_float_direct(const string& in)
{
    double sum = 0.0;
    size_t len = in.length();
    const char* str = in.c_str();
    for (size_t i=0; i<len; ++i)
    {
        double f = scan_float(str, i, len);
        sum += f;
    }
    printf("scanned %f\n", sum);
}

int main()
{
    const int n = 1000000;
    printf("count = %d\n", n);

    string in = test_generate(n);    
    {
        struct timeval t1;
        gettimeofday(&t1, 0);
        printf("scan start\n");

        read_float_ss(in);

        struct timeval t2;
        gettimeofday(&t2, 0);
        double elapsed = (t2.tv_sec - t1.tv_sec) * 1000000.0;
        elapsed += (t2.tv_usec - t1.tv_usec) / 1000.0;
        printf("elapsed %.2fms\n", elapsed);
    }

    {
        struct timeval t1;
        gettimeofday(&t1, 0);
        printf("scan start\n");

        read_float_direct(in);

        struct timeval t2;
        gettimeofday(&t2, 0);
        double elapsed = (t2.tv_sec - t1.tv_sec) * 1000000.0;
        elapsed += (t2.tv_usec - t1.tv_usec) / 1000.0;
        printf("elapsed %.2fms\n", elapsed);
    }
    return 0;
}

以下是i7 Mac Book Pro的控制台输出（在XCode 4.6中编译）。

count = 1000000
generated -1073202156466.638184
scan start
scanned -1073202156466.638184
elapsed 83.34ms
scan start
scanned -1073202156466.638184
elapsed 53.50ms

Answer 8

这是一个更完整的（尽管不是任何标准的“官方”）高速字符串来加倍例程，因为好的C ++ 17 from_chars()解决方案仅适用于MSVC（不适用于clang或gcc）。

满足crack_atof

https://gist.github.com/oschonrock/a410d4bec6ec1ccc5a3009f0907b3d15

不是我的工作，我只是对其进行了稍微的重构。并更改了签名。该代码非常易于理解，很明显为什么它很快。而且速度非常快，请参见此处的基准测试：

https://www.codeproject.com/Articles/1130262/Cplusplus-string-view-Conversion-to-Integral-Types

我用3,000,000个浮点数的11,000,000行（在csv中为15位精度，这很重要！）运行它。在我年龄较大的第二代Core i7 2600上，它的运行时间为1.327秒。在Kubuntu 19.04上编译clang V8.0.0 -O2。

下面的完整代码。我正在使用mmap，因为str-> float不再是唯一的瓶颈，这要归功于crack_atof。我将mmap内容包装到一个类中，以确保RAII发布地图。


#include <iomanip>
#include <iostream>

// for mmap:
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>

class MemoryMappedFile {
public:
  MemoryMappedFile(const char* filename) {
    int fd = open(filename, O_RDONLY);
    if (fd == -1) throw std::logic_error("MemoryMappedFile: couldn't open file.");

    // obtain file size
    struct stat sb;
    if (fstat(fd, &sb) == -1) throw std::logic_error("MemoryMappedFile: cannot stat file size");
    m_filesize = sb.st_size;

    m_map = static_cast<const char*>(mmap(NULL, m_filesize, PROT_READ, MAP_PRIVATE, fd, 0u));
    if (m_map == MAP_FAILED) throw std::logic_error("MemoryMappedFile: cannot map file");
  }

  ~MemoryMappedFile() {
    if (munmap(static_cast<void*>(const_cast<char*>(m_map)), m_filesize) == -1)
      std::cerr << "Warnng: MemoryMappedFile: error in destructor during `munmap()`\n";
  }

  const char* start() const { return m_map; }
  const char* end() const { return m_map + m_filesize; }

private:
  size_t m_filesize = 0;
  const char* m_map = nullptr;
};

// high speed str -> double parser
double pow10(int n) {
  double ret = 1.0;
  double r   = 10.0;
  if (n < 0) {
    n = -n;
    r = 0.1;
  }

  while (n) {
    if (n & 1) {
      ret *= r;
    }
    r *= r;
    n >>= 1;
  }
  return ret;
}

double crack_atof(const char* start, const char* const end) {
  if (!start || !end || end <= start) {
    return 0;
  }

  int sign         = 1;
  double int_part  = 0.0;
  double frac_part = 0.0;
  bool has_frac    = false;
  bool has_exp     = false;

  // +/- sign
  if (*start == '-') {
    ++start;
    sign = -1;
  } else if (*start == '+') {
    ++start;
  }

  while (start != end) {
    if (*start >= '0' && *start <= '9') {
      int_part = int_part * 10 + (*start - '0');
    } else if (*start == '.') {
      has_frac = true;
      ++start;
      break;
    } else if (*start == 'e') {
      has_exp = true;
      ++start;
      break;
    } else {
      return sign * int_part;
    }
    ++start;
  }

  if (has_frac) {
    double frac_exp = 0.1;

    while (start != end) {
      if (*start >= '0' && *start <= '9') {
        frac_part += frac_exp * (*start - '0');
        frac_exp *= 0.1;
      } else if (*start == 'e') {
        has_exp = true;
        ++start;
        break;
      } else {
        return sign * (int_part + frac_part);
      }
      ++start;
    }
  }

  // parsing exponent part
  double exp_part = 1.0;
  if (start != end && has_exp) {
    int exp_sign = 1;
    if (*start == '-') {
      exp_sign = -1;
      ++start;
    } else if (*start == '+') {
      ++start;
    }

    int e = 0;
    while (start != end && *start >= '0' && *start <= '9') {
      e = e * 10 + *start - '0';
      ++start;
    }

    exp_part = pow10(exp_sign * e);
  }

  return sign * (int_part + frac_part) * exp_part;
}

int main() {
  MemoryMappedFile map  = MemoryMappedFile("FloatDataset.csv");
  const char* curr      = map.start();
  const char* start     = map.start();
  const char* const end = map.end();

  uintmax_t lines_n = 0;
  int cnt              = 0;
  double sum           = 0.0;
  while (curr && curr != end) {
    if (*curr == ',' || *curr == '\n') {
      // std::string fieldstr(start, curr);
      // double field = std::stod(fieldstr);
      // m_numLines = 11000000 cnt=33000000 sum=16498294753551.9
      // real 5.998s

      double field = crack_atof(start, curr);
      // m_numLines = 11000000 cnt=33000000 sum=16498294753551.9
      // real 1.327s

      sum += field;
      ++cnt;
      if (*curr == '\n') lines_n++;
      curr++;
      start = curr;
    } else {
      ++curr;
    }
  }

  std::cout << std::setprecision(15) << "m_numLines = " << lines_n << " cnt=" << cnt
            << " sum=" << sum << "\n";
}

代码也位于github上：

https://gist.github.com/oschonrock/67fc870ba067ebf0f369897a9d52c2dd

如何快速解析C ++中以空格分隔的浮点数？

8 个答案:

更新

要点：

基准详情

环境：

完整代码