Question

我怀疑。我正在编写一个代码来将信息从文件加载到计算机内存。但根据文件'格式'，所需的时间差别很大。

让我更好地解释一下。在我正在阅读的文件中，我有一种表格，其中随机字符串用“|”分隔。这里有一个表格示例（包含5行和5列）。

的 Table.txt 的

0|42sKuG^uM|24465\lHXP|2996fQo\kN|293cvByiV
1|14772cjZ`SN|28704HxDYjzC|6869xXj\nIe|27530EymcTU
2|9041ByZM]I|24371fZKbNk|24085cLKeIW|16945TuuU\Nc
3|16542M[Uz\|13978qMdbyF|6271ait^h|13291_rBZS
4|4032aFqa|13967r^\\`T|27754k]dOTdh|24947]v_uzg

我怀疑的是，如果表格有例如100.000行和100列，或者如果它有100行和100.000列，那么加载此信息所花费的时间就大不相同了（在这最后一种情况下，时间要高得多）。实际上，访问这些信息所花费的时间也比其他情况要大。

所以怀疑是，为什么如果表格大小相同，这个时间是如此不同???

在这里，您可以获得从文件中读取此信息并存储在计算机中的部分代码。

从Table.txt文件中读取数据并将其存储在计算机内存中的代码

string ruta_base("C:\\a\\Table.txt"); // Folder where my "Table.txt" is found

string temp; // Variable where every row from the Table.txt file will be firstly stored
vector<string> buffer; // Variable where every different row will be stored after separating the different elements by tokens.
vector<ElementSet> RowsCols; // Variable with a class that I have created, that simulated a vector and every vector element is a row of my table (vector<string> buffer)

ifstream ifs(ruta_base.c_str());

while(getline( ifs, temp )) // We will read and store line per line until the end of the ".txt" file. 
{
    size_t tokenPosition = temp.find("|"); // When we find the simbol "|" we will identify different element. So we separate the string temp into tokens that will be stored in vector<string> buffer

    while (tokenPosition != string::npos)
    {    
        string element;
        tokenPosition = temp.find("|");      

        element = temp.substr(0, tokenPosition);
        buffer.push_back(element);
        temp.erase(0, tokenPosition+1);
    }

    ElementSet ss(0,buffer); 
    buffer.clear();
    RowsCols.push_back(ss); // We store all the elements of every row (stores as vector<string> buffer) in a different position in "RowsCols" 
}

vector<Table> TablesDescriptor;

Table TablesStorage(RowsCols);
TablesDescriptor.push_back(TablesStorage);

DataBase database(1, TablesDescriptor);

在这里，我添加了我对您的所有反馈所做的解决方案

string ruta_base("C:\\a\\Table.txt"); // Folder where my "Table.txt" is found

string temp; // Variable where every row from the Table.txt file will be firstly stored
vector<string> buffer; // Variable where every different row will be stored after separating the different elements by tokens.
vector<ElementSet> RowsCols; // Variable with a class that I have created, that simulated a vector and every vector element is a row of my table

ifstream ifs(ruta_base.c_str());

while(getline( ifs, temp )) // We will read and store line per line until the end of the ".txt" file. 
{
       size_t tokenPosition = temp.find("|"); // When we find the simbol "|" we will identify different element. So we separate the string temp into tokens that will be stored in vector<string> buffer

       const char* p = temp.c_str();
      char* p1 = strdup(p);

       char* pch = strtok(p1, "|");
    while(pch)
        {
            buffer.push_back(string(pch));
            pch = strtok(NULL,"|");
        }
        free(p1);

        ElementSet sss(0,buffer);
        buffer.clear();
        RowsCols.push_back(sss);
}

vector<Table> TablesDescriptor;

Table TablesStorage(RowsCols);
TablesDescriptor.push_back(TablesStorage);

DataBase database(1, TablesDescriptor);

Answer 1

你没有为某些代码（例如ElementSet）发布实现，但即使在我们可以看到的情况下，也有一些操作会消耗时间随着文件中当前行的长度线性增加，例如

temp.erase(0, tokenPosition+1);

不继续从字符串的开头删除位会更有效 - 这会强制整个100,000行的字段不断地通过内存复制，压缩回字符串的开头。相反，跟踪您当前从中提取的位置，并从该偏移量开始下一次搜索，同时使用该substr()操作的偏移量。如果你开始考虑内存内容，你将学会分析这类问题。还可以使用分析器向您显示哪些特定的代码行很慢。

Answer 2

我认为问题出在temp.erase(0, tokenPosition+1);声明中。如果字符串很小（作为你的第一种情况），没有太多的数据需要移位，但在后一种情况下，有太多的数据需要移位，因此速度较慢。我建议您尝试删除erase并使用基于范围的find方法。您可以使用here给出的第二个重载。使用c_str()获取字符串的const char*并为其添加偏移量以指定起点。另外，如果您可以使用boost，请考虑使用Boost.Tokenizer来标记字符串。

Answer 3

字符串操作是一个杀手。你正在删除已读过的字符串部分;每次发生时，必须重新分配和/或移动字符串。

保持指向字符串的指针，避免任何重新分配它的操作。

从文件加载信息时出现问题

3 个答案: