解析大型文本文件时如何提高性能-StreamReader + Regex

时间:2019-02-18 02:36:52

标签: c# regex streamreader

我正在开发一个Windows窗体应用程序,该应用程序可以使用其他软件生成的机器人程序并对其进行修改。修改过程如下:

  1. 使用StreamReader.ReadLine()逐行解析文件
  2. 正则表达式用于在文件中搜索特定的关键字。如果匹配,则将匹配的字符串复制到另一个字符串,并用新的机器人代码行替换。
  3. 修改后的代码保存在字符串中,最后写入新文件。

  4. 使用Regex获取的所有匹配字符串集合也都保存在字符串中,并最终写入新文件中。

我已经能够成功做到这一点

    private void Form1_Load(object sender, EventArgs e)
    {
        string NextLine = null;
        string CurrLine = null;
        string MoveL_Pos_Data = null;
        string MoveL_Ref_Data = null;
        string MoveLFull = null;
        string ModCode = null;
        string TAB = "\t";
        string NewLine = "\r\n";
        string SavePath = null;
        string ExtCode_1 = null;
        string ExtCode_2 = null;
        string ExtCallMod = null;

        int MatchCount = 0;
        int NumRoutines = 0;

        try
        {
            // Ask user location of the source file
            // Displays an OpenFileDialog so the user can select a Cursor.  
            OpenFileDialog openFileDialog1 = new OpenFileDialog
            {
                Filter = "MOD Files|*.mod",
                Title = "Select an ABB RAPID MOD File"
            };

            // Show the Dialog.  
            // If the user clicked OK in the dialog and  
            // a .MOD file was selected, open it.  
            if (openFileDialog1.ShowDialog() == System.Windows.Forms.DialogResult.OK)
            {
                // Assign the cursor in the Stream to the Form's Cursor property.  
                //this.Cursor = new Cursor(openFileDialog1.OpenFile());
                using (StreamReader sr = new StreamReader(openFileDialog1.FileName))
                {
                    // define a regular expression to search for extr calls 
                    Regex Extr_Ex = new Regex(@"\bExtr\(-?\d*.\d*\);", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Multiline);
                    Regex MoveL_Ex = new Regex(@"\bMoveL\s+(.*)(z\d.*)", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Multiline);

                    Match MoveLString = null;

                    while (sr.Peek() >= 0)
                    {
                        CurrLine = sr.ReadLine();
                        //Console.WriteLine(sr.ReadLine());

                        // check if the line is a match 
                        if (Extr_Ex.IsMatch(CurrLine))
                        {
                            // Keep a count for total matches
                            MatchCount++;

                            // Save extr calls in a string
                            ExtCode_1 += NewLine + TAB + TAB + Extr_Ex.Match(CurrLine).ToString();


                            // Read next line (always a MoveL) to get Pos data for TriggL
                            NextLine = sr.ReadLine();
                            //Console.WriteLine(NextLine);

                            if (MoveL_Ex.IsMatch(NextLine))
                            {
                                // Next Line contains MoveL
                                // get matched string 
                                MoveLString = MoveL_Ex.Match(NextLine);
                                GroupCollection group = MoveLString.Groups;
                                MoveL_Pos_Data = group[1].Value.ToString();
                                MoveL_Ref_Data = group[2].Value.ToString();
                                MoveLFull = MoveL_Pos_Data + MoveL_Ref_Data;                                

                            }

                            // replace Extr with follwing commands
                            ModCode += NewLine + TAB + TAB + "TriggL " + MoveL_Pos_Data + "extr," + MoveL_Ref_Data;
                            ModCode += NewLine + TAB + TAB + "WaitDI DI1_1,1;";
                            ModCode += NewLine + TAB + TAB + "MoveL " + MoveLFull;
                            ModCode += NewLine + TAB + TAB + "Reset DO1_1;";
                            //break;

                        }
                        else
                        {
                            // No extr Match
                            ModCode += "\r\n" + CurrLine;
                        }                     

                    }

                    Console.WriteLine($"Total Matches: {MatchCount}");
                }


            }

            // Write modified code into a new output file
            string SaveDirectoryPath = Path.GetDirectoryName(openFileDialog1.FileName);
            string ModName = Path.GetFileNameWithoutExtension(openFileDialog1.FileName);
            SavePath = SaveDirectoryPath + @"\" + ModName + "_rev.mod";
            File.WriteAllText(SavePath, ModCode);

            //Write Extr matches into new output file 
            //Prepare module
            ExtCallMod = "MODULE ExtruderCalls";

            // All extr calls in one routine
            //Prepare routines
            ExtCallMod += NewLine + NewLine + TAB + "PROC Prg_ExtCall"; // + 1;
                ExtCallMod += ExtCode_1;
                ExtCallMod += NewLine + NewLine + TAB + "ENDPROC";
                ExtCallMod += NewLine + NewLine;

            //}

            ExtCallMod += "ENDMODULE";

            // Write to file
            string ExtCallSavePath = SaveDirectoryPath + @"\ExtrCalls.mod";                
            File.WriteAllText(ExtCallSavePath, ExtCallMod);                

        }

        catch (Exception ex)
        {
            Console.WriteLine(ex.ToString());                
        }

    }                    
}

虽然这可以帮助我实现所需的目标,但过程非常缓慢。由于我是C#编程的新手,因此我怀疑这种缓慢是由于将原始文件的内容复制到字符串中而没有替换掉原先的内容(我不确定是否可以直接替换原始文件中的内容)。对于20,000行的输入文件,整个过程将花费5分钟多一点的时间。

我曾经遇到以下错误: Message = Managed Debugging Assistant'ContextSwitchDeadlock':'CLR在60秒钟内无法从COM上下文0xb27138过渡到COM上下文0xb27080。 拥有目标上下文/公寓的线程很可能执行非泵送等待或处理非常长时间运行的操作而不泵送Windows消息。这种情况通常会对性能产生负面影响,甚至可能导致应用程序变得无响应或内存使用量随时间不断累积。为避免此问题,所有单线程单元(STA)线程都应使用泵送等待原语(例如CoWaitForMultipleHandles),并在长时间运行的操作中定期泵送消息。'

我可以通过在调试器设置中禁用“ ContextSwitchDeadlock”设置来克服它。这可能不是最佳做法。

有人可以帮助我提高代码的性能吗?

编辑:我发现机器人控制器对MOD文件(输出文件)中的行数有限制。允许的最大行数为32768。我想出了一个逻辑,用于将字符串生成器的内容拆分为单独的输出文件,如下所示:

// Split modCodeBuilder into seperate strings based on final size
        const int maxSize = 32500;
        string result = modCodeBuilder.ToString();
        string[] splitResult = result.Split(new string[] { "\r\n" }, StringSplitOptions.None);
        string[] splitModCode = new string[maxSize]; 

        // Setup destination directory to be same as source directory
        string destDir = Path.GetDirectoryName(fileNames[0]);

        for (int count = 0; ; count++)
        {
            // Get the next batch of text by skipping the amount
            // we've taken so far and then taking the maxSize.
            string modName = $"PrgMOD_{count + 1}";
            string procName = $"Prg_{count + 1}()";

            // Use Array Copy to extract first 32500 lines from modCode[]
            int src_start_index = count * maxSize;
            int srcUpperLimit = splitResult.GetUpperBound(0);
            int dataLength = maxSize;

            if (src_start_index > srcUpperLimit) break; // Exit loop when there's no text left to take

            if (src_start_index > 1)
            {
                // Make sure calculate right length so that src index is not exceeded
                dataLength = srcUpperLimit - maxSize;
            }                

            Array.Copy(splitResult, src_start_index, splitModCode, 0, dataLength);
            string finalModCode = String.Join("\r\n", splitModCode);

            string batch = String.Concat("MODULE ", modName, "\r\n\r\n\tPROC ", procName, "\r\n", finalModCode, "\r\n\r\n\tENDPROC\r\n\r\nENDMODULE");

            //if (batch.Length == 0) break; 

            // Generate file name based on count
            string fileName = $"ABB_R3DP_{count + 1}.mod";

            // Write our file text
            File.WriteAllText(Path.Combine(destDir, fileName), batch);

            // Write status to output textbox
            TxtOutput.AppendText("\r\n");
            TxtOutput.AppendText("\r\n");
            TxtOutput.AppendText($"Modified MOD File: {fileName} is generated sucessfully! It is saved to location: {Path.Combine(destDir, fileName)}");
        }

1 个答案:

答案 0 :(得分:0)

字符串连接可能花费很长时间。相反,使用StringBuilder可以提高您的效果:

private static void GenerateNewFile(string sourceFullPath)
{
    string posData = null;
    string refData = null;
    string fullData = null;

    var modCodeBuilder = new StringBuilder();
    var extCodeBuilder = new StringBuilder();

    var extrRegex = new Regex(@"\bExtr\(-?\d*.\d*\);", RegexOptions.Compiled | 
        RegexOptions.IgnoreCase | RegexOptions.Multiline);

    var moveLRegex = new Regex(@"\bMoveL\s+(.*)(z\d.*)", RegexOptions.Compiled | 
        RegexOptions.IgnoreCase | RegexOptions.Multiline);

    int matchCount = 0;
    bool appendModCodeNext = false;

    foreach (var line in File.ReadLines(sourceFullPath))
    {
        if (appendModCodeNext)
        {
            if (moveLRegex.IsMatch(line))
            {
                GroupCollection group = moveLRegex.Match(line).Groups;

                if (group.Count > 2)
                {
                    posData = group[1].Value;
                    refData = group[2].Value;
                    fullData = posData + refData;
                }
            }

            modCodeBuilder.Append("\t\tTriggL ").Append(posData).Append("extr,")
                .Append(refData).Append("\r\n\t\tWaitDI DI1_1,1;\r\n\t\tMoveL ")
                .Append(fullData).AppendLine("\r\n\t\tReset DO1_1;");

            appendModCodeNext = false;
        }
        else if (extrRegex.IsMatch(line))
        {
            matchCount++;
            extCodeBuilder.Append("\t\t").AppendLine(extrRegex.Match(line).ToString());
            appendModCodeNext = true;
        }
        else
        {
            modCodeBuilder.AppendLine(line);
        }
    }

    Console.WriteLine($"Total Matches: {matchCount}");

    string destDir = Path.GetDirectoryName(sourceFullPath);
    var savePath = Path.Combine(destDir, Path.GetFileNameWithoutExtension(sourceFullPath), 
        "_rev.mod");

    File.WriteAllText(savePath, modCodeBuilder.ToString());

    var extCallMod = string.Concat("MODULE ExtruderCalls\r\n\r\n\tPROC Prg_ExtCall",
        extCodeBuilder.ToString(), "\r\n\r\n\tENDPROC\r\n\r\nENDMODULE");

    File.WriteAllText(Path.Combine(destDir, "ExtrCalls.mod"), extCallMod);
}

您在注释中提到要批量处理文本并将其写入单独的文件。一种实现方法是将字符串视为char[],然后使用System.Linq扩展方法SkipTakeSkip将跳过字符串中的一定数量的字符,然后Take将采用一定数量的字符并将其返回到IEnumerabe<char>中。然后,我们可以使用string.Concat将其转换为字符串并将其写入文件。

如果我们有一个表示最大大小的常数,并且计数器以0开头,则可以使用for循环来递增计数器并跳过counter * max个字符,并且然后从字符串中提取max个字符。我们还可以使用counter变量来创建文件名,因为它会在每次迭代时递增:

const int maxSize = 32500;
string result = modCodeBuilder.ToString();

for (int count = 0;; count++)
{
    // Get the next batch of text by skipping the amount
    // we've taken so far and then taking the maxSize.
    string batch = string.Concat(result.Skip(count * maxSize).Take(maxSize));

    if (batch.Length == 0) break; // Exit loop when there's no text left to take

    // Generate file name based on count
    string fileName = $"filename_{count + 1}.mod";

    // Write our file text
    File.WriteAllText(Path.Combine(destDir, fileName), batch);
}

另一种可能更快的方法是使用string.Substring,然后使用count * maxSize作为要获取的子字符串的起始索引。然后,我们只需要确保length不会超出字符串的界限,然后将子字符串写入文件中即可:

for (int count = 0;; count++)
{
    // Get the bounds for the substring (startIndex and length)
    var startIndex = count * maxSize;
    var length = Math.Min(result.Length - startIndex, maxSize);

    if (length < 1) break; // Exit loop when there's no text left to take

    // Get the substring and file name
    var batch = result.Substring(startIndex, length);
    string fileName = $"filename_{count + 1}.mod";

    // Write our file text  
    File.WriteAllText(Path.Combine(destDir, fileName), batch);
}

请注意,这会将文本分成正好32500个字符的块(最后一个块除外)。如果您只想整条生产线,则需要做更多的工作,但仍然不难。