Question

我有以下格式的文件：

[1]
Parameter1=Value1
.
.
.
End
[2]
.
.

括号内的数字表示实体的ID。有大约4500个参与者。我需要解析所有的entites并选择符合我的参数和值的那些。文件大小约为20mb。我的第一种方法是逐行读取文件并将它们存储在结构数组中，如：

struct Component{
    std::string parameter;
    std::string value;
};
struct Entity{
    std::string id;
    std::list<Component> components;
};
std::list<Entity> g_entities;

但是这种方法占用了大量内存并且非常慢。我还试过只存储与我的参数/值匹配的那些。但这也非常缓慢并占据了相当多的记忆。理想情况下，我想将所有数据存储在内存中，这样我就不必在每次需要过滤我的参数/值时加载文件，如果它可以在合理的内存使用量下使用。

编辑1：我逐行阅读文件：

            std::ifstream readTemp(filePath);
            std::stringstream dataStream;
            dataStream << readTemp.rdbuf();
            readTemp.close();

            while (std::getline(dataStream, line)){
                    if (line.find('[') != std::string::npos){
                        // Create Entity
                        Entity entity;

                        // Set entity id
                        entity.id = line.substr(line.find('[') + 1, line.find(']') - 1);

                        // Read all lines until EnumEnd=0
                        while (1){
                            std::getline(dataStream, line);
                            // Break loop if end of entity
                            if (line.find("EnumEnd=0") != std::string::npos){
                                if (CheckMatch(entity))
                                    entities.push_back(entity);
                                entity.components.clear();
                                break;
                            }


                            Component comp;
                            int pos_eq = line.find('='); 
                            comp.parameterId = line.substr(0, pos_eq);
                            comp.value = line.substr(pos_eq + 1);

                            entity.components.push_back(comp);
                        }
                    }
                }

Answer 1

PS：编辑完成后。和关于记忆消耗的评论

500MB / 20MB = 25.

如果每行长度为25个字符，则内存消耗看起来不错。

好的，您可以使用查找表将参数名称映射到数字。如果名称集很小，这将最多可以节省2倍的消耗。

您的数据结构可能如下所示：

std::map<int, std::map<int, std::string> > my_ini_file_data;
std::map<std::string, int> param_to_idx;

（如果部分中的参数名称（您调用的实体）不是唯一的）

输入数据是：

std::string param = "Param";
std::string value = "Val";
int entity_id = 0;
if ( param_to_idx.find(param) == param_to_idx.end() )
  param_to_idx[param] = param_to_idx.size();
my_ini_file_data[entity_id][ param_to_idx[param] ] = value;

获取数据是：

    value = my_ini_file_data[entity_id][ param_to_idx[param] ];

如果值集也比条目数小很多，你甚至可以将值映射到数字：

std::map<int, std::map<int, int> > my_ini_file_data;
std::map<std::string, int> param_to_idx;
std::map<std::string, int> value_to_idx;
std::map<int, std::string> idx_to_value;

输入数据是：

std::string param = "Param";
std::string value = "Val";
int entity_id = 0;
if ( param_to_idx.find(param) == param_to_idx.end() )
      param_to_idx[param] = param_to_idx.size();
if ( value_to_idx.find(value) == value_to_idx.end() )
{  
      int idx = value_to_idx.size();
      value_to_idx[value] = idx;
      idx_to_value[idx] = value;
}

my_ini_file_data[entity_id][ param_to_idx[param] ] = value_to_idx[value];

获取数据是：

value = idx_to_value[my_ini_file_data[entity_id][ param_to_idx[param] ] ];

希望，这有帮助。

初步回答

关于内存，除非你有一种内存非常小的嵌入式系统，否则我不在乎。

关于速度，我可以给你一些建议：

找出，瓶颈是什么。

使用std :: list！使用std :: vector，每次向量增长时都会重新初始化内存。如果由于某种原因你最后需要一个向量，那么创建保留所需条目数的向量，你可以通过调用list :: size（）
写一个while循环，你只调用getline。如果仅此而已已经很慢，一次读取整个块，创建一个阅读器流超出char *块并从流中逐行读取。

如果简单读取速度正常，请优化解析代码。您可以通过存储位置来减少查找呼叫的数量。 e.g。

int pos_begin = line.find('[]');
if (pos_begin != std::string::npos){
    int pos_end = line.find(']');
    if (pos_end != std::string::npos){
        entity.id = line.substr(pos_begin + 1, pos_begin - 1);

        // Read all lines until EnumEnd=0
        while (1){
            std::getline(readTemp, line);
            // Break loop if end of entity
            if (line.find("EnumEnd=0") != std::string::npos){
                if (CheckMatch(entity))
                    entities.push_back(entity);
                break;
            }


            Component comp;
            int pos_eq = line.find('=');

            comp.parameter= line.substr(0, pos_eq);
            comp.value = line.substr(pos_eq + 1);

            entity.components.push_back(comp);
        }
    }
}

根据实体的大小，检查CheckMatch是否很慢。实体越小，代码越慢 - 在这种情况下。

Answer 2

你可以通过interning你的参数和值使用更少的内存，以免存储它们的多个副本。

您可以将字符串映射到您在加载文件时创建的唯一数字ID，然后在查询数据结构时使用ID。以最初可能较慢的解析为代价，之后使用这些结构应该更快，因为您只需要匹配32位整数而不是比较字符串。

用于存储每个字符串一次的概略证明：

#include <unordered_map>
#include <string>
#include <iostream>

using namespace std;

int string_id(const string& s) {
  static unordered_map<string, int> m;
  static int id = 0;

  auto it = m.find(s);
  if (it == m.end()) {
    m[s] = ++id;
    return id;
  } else {
    return it->second;
  }
}

int main() {
  // prints 1 2 2 1
  cout << string_id("hello") << " ";
  cout << string_id("world") << " "; 
  cout << string_id("world") << " ";
  cout << string_id("hello") << endl; 
}

unordered_map将最终存储每个字符串一次，因此您将设置为内存。根据您的匹配功能，您可以定义

struct Component {
    int parameter;
    int value;
};

然后您的匹配可能类似myComponent.parameter == string_id("some_key")甚至myComponent.parameter == some_stored_string_id。如果你想要你的字符串，你也需要反向映射。

使用大文本文件

2 个答案: