Question

我对C ++很陌生，我正在努力解决以下问题：
我正在从iptables解析syslog消息。每条消息都如下：
192.168.1.1:20200:Dec 11 15:20:36 SRC=192.168.1.5 DST=8.8.8.8 LEN=250
我需要快速（因为新消息来得非常快）解析字符串以获得SRC，DST和LEN。
如果它是一个简单的程序，我会使用std::find来查找STR子字符串的索引，然后在循环中将每个下一个字符添加到数组中，直到遇到空格。然后我会对DST和LEN执行相同的操作例如，

std::string x = "15:30:20 SRC=192.168.1.1 DST=15.15.15.15 LEN=255";
std::string substr;

std::cout << "Original string: \"" << x << "\"" << std::endl;

// Below "magic number" 4 means length of "SRC=" string 
// which is the same for "DST=" and "LEN="    

// For SRC
auto npos = x.find("SRC");
if (npos != std::string::npos) {
    substr = x.substr(npos + 4, x.find(" ", npos) - (npos+4));
    std::cout << "SRC: " << substr << std::endl;
}

// For DST
npos = x.find("DST");
if (npos != std::string::npos) {
    substr = x.substr(npos + 4, x.find(" ", npos) - (npos + 4));
    std::cout << "DST: " << substr << std::endl;
}

// For LEN
npos = x.find("LEN");
if (npos != std::string::npos) {
    substr = x.substr(npos + 4, x.find('\0', npos) - (npos + 4));
    std::cout << "LEN: " << substr << std::endl;
}

然而，在我的情况下，我需要非常快速地完成这项工作，理想情况是在一次迭代中你能就此给我一些建议吗？

Answer 1

“快速，理想情况下在一次迭代中” - 实际上，程序的速度并不取决于源代码中可见的循环次数。特别是正则表达式是一种隐藏多个嵌套循环的好方法。

你的解决方案实际上非常好。在找到“SRC”之前不会浪费太多时间，并且不会搜索超过检索IP地址所需的时间。当然，当搜索“SRC”时，它在“Sep”的第一个“S”上有误报，但这可以通过下一次比较来解决。如果您确定第一次出现的“SRC”位于第20列中的某个位置，则可以通过跳过前20个字符来节省一点点速度。（检查你的日志，我不知道）

Answer 2

您可以使用std::regex，例如：

std::string x = "15:30:20 SRC=192.168.1.1 DST=15.15.15.15 LEN=255";

std::regex const r(R"(SRC=(\S+) DST=(\S+) LEN=(\S+))");
std::smatch matches;
if(regex_search(x, matches, r)) {
    std::cout << "SRC " << matches.str(1) << '\n';
    std::cout << "DST " << matches.str(2) << '\n';
    std::cout << "LEN " << matches.str(3) << '\n';
}

请注意matches.str(idx)会创建一个匹配的新字符串。使用matches[idx]，您可以在不创建新字符串的情况下获取子字符串的迭代器。

Answer 3

如果你的格式是固定和验证的（你可以在输入字符串不包含完全预期的字符时接受未定义的行为），那么你可以通过手工编写更大的部分并跳过字符串终止来挤出一些性能测试将成为所有标准功能的一部分。

// buf_ptr will be updated to point to the first character after the " SRC=x.x.x.x" sequence
unsigned long GetSRC(const char*& buf_ptr)
{
    // Don't search like this unless you have a trusted input format that's guaranteed to contain " SRC="!!!
    while (*buf_ptr != ' ' ||
        *(buf_ptr + 1) != 'S' ||
        *(buf_ptr + 2) != 'R' ||
        *(buf_ptr + 3) != 'C' ||
        *(buf_ptr + 4) != '=') 
    {
        ++buf_ptr;
    }
    buf_ptr += 5;
    char* next;

    long part = std::strtol(buf_ptr, &next, 10);
    // part is now the first number of the IP. Depending on your requirements you may want to extract the string instead
    unsigned long result = (unsigned long)part << 24;

    // Don't use 'next + 1' like this unless you have a trusted input format!!!
    part = std::strtol(next + 1, &next, 10);
    // part is now the second number of the IP. Depending on your requirements ...
    result |= (unsigned long)part << 16;

    part = std::strtol(next + 1, &next, 10);
    // part is now the third number of the IP. Depending on your requirements ...
    result |= (unsigned long)part << 8;

    part = std::strtol(next + 1, &next, 10);
    // part is now the fourth number of the IP. Depending on your requirements ...
    result |= (unsigned long)part;

    // update the buf_ptr so searching for the next information ( DST=x.x.x.x) starts at the end of the currently parsed parts
    buf_ptr = next;
    return result;
}

用法：

const char* x_str = x.c_str();
unsigned long srcIP = GetSRC(x_str);
// now x_str will point to " DST=15.15.15.15 LEN=255" for further processing

std::cout << "SRC=" << (srcIP >> 24) << "." << ((srcIP >> 16) & 0xff) << "." << ((srcIP >> 8) & 0xff) << "." << (srcIP & 0xff) << std::endl;

注意我决定将整个提取的源IP写入单个32位无符号。如果需要，您可以决定完全不同的存储模型。

即使你不能对你的格式持乐观态度，使用在处理零件时更新的指针并继续使用剩余的字符串而不是从0开始可能是提高性能的好主意。

当然，我认为你的std::cout << ...行只是用于开发测试，因为否则所有的微优化都会变得毫无用处。

如何在C ++中快速查找和子串字符串中的多字符项？

3 个答案: