Question

我需要一种快速方法来处理大文本文件

我有2个文件，一个大文本文件（~20Gb）和另一个包含约1200万个组合词列表的文本文件

我想找到第一个文本文件中的所有组合单词并将其替换为另一个组合单词（带下划线的组合单词）

示例“计算机信息”＆gt;替换为＆gt; “Computer_Information”

我使用此代码，但性能非常差（我使用16Gb Ram和16 Core在Hp G7服务器上测试）

public partial class Form1 : Form
{
    HashSet<string> wordlist = new HashSet<string>();

    private void loadComboWords()
    {
        using (StreamReader ff = new StreamReader(txtComboWords.Text))
        {
            string line;
            while ((line = ff.ReadLine()) != null)
            {
                wordlist.Add(line);
            }
        }
    }

    private void replacewords(ref string str)
    {

        foreach (string wd in wordlist)
        {
          //  ReplaceEx(ref str,wd,wd.Replace(" ","_"));
            if (str.IndexOf(wd) > -1)
                str.Replace(wd, wd.Replace(" ", "_"));
        }
    }

    private void button3_Click(object sender, EventArgs e)
    {
        string line;
        using (StreamReader fread = new StreamReader(txtFirstFile.Text))
        {
            string writefile = Path.GetFullPath(txtFirstFile.Text) + Path.GetFileNameWithoutExtension(txtFirstFile.Text) + "_ReplaceComboWords.txt";
            StreamWriter sw = new StreamWriter(writefile);
            long intPercent;
            label3.Text = "initialing";
            loadComboWords();

            while ((line = fread.ReadLine()) != null)
            {
                replacewords(ref line);
                sw.WriteLine(line);

                intPercent = (fread.BaseStream.Position * 100) / fread.BaseStream.Length;
                Application.DoEvents();
                label3.Text = intPercent.ToString();
            }
            sw.Close();
            fread.Close();
            label3.Text = "Finished";
        }
    }
}

任何想法在合理的时间内完成这项工作

由于

Answer 1

乍一看你采取的方法看起来很好 - 它应该可以正常工作，并且没有任何明显的因素会导致例如大量的垃圾收集。

我认为主要的是你只会使用这16个核心中的一个：没有任何东西可以分担其他15个核心的负载。

我认为最简单的方法是将大型20Gb文件拆分为16个块，然后将每个块一起分析，然后再将这些块合并在一起。与扫描这16个块所需的~16倍增益相比，分割和重新组装文件所需的额外时间应该是最小的。

概括地说，一种方法可能是：

    private List<string> SplitFileIntoChunks(string baseFile)
    {
        // Split the file into chunks, and return a list of the filenames.
    }

    private void AnalyseChunk(string filename)
    {
        // Analyses the file and performs replacements, 
        // perhaps writing to the same filename with a different
        // file extension
    }

    private void CreateOutputFileFromChunks(string outputFile, List<string> splitFileNames)
    {
        // Combines the rewritten chunks created by AnalyseChunk back into
        // one large file, outputFile.
    }

    public void AnalyseFile(string inputFile, string outputFile)
    {
        List<string> splitFileNames = SplitFileIntoChunks(inputFile);

        var tasks = new List<Task>();
        foreach (string chunkName in splitFileNames)
        {
            var task = Task.Factory.StartNew(() => AnalyseChunk(chunkName));
            tasks.Add(task);
        }

        Task.WaitAll(tasks.ToArray());

        CreateOutputFileFromChunks(outputFile, splitFileNames);
    }

一点点：将流的长度计算移出循环，你只需要得到一次。

编辑：还包括@Pavel Gatilov的想法，即反转内循环的逻辑并搜索1200万列表中的每一行。

Answer 2

几个想法：

我认为将每行分成单词并查看单词列表中是否出现多个单词会更有效。散列集中的10次查找优于数百万次子串的搜索。如果您有复合关键字，请制作适当的索引：包含真实关键字中出现的所有单个单词以及包含所有真实关键字的单个单词。
也许，将字符串加载到StringBuilder更适合替换。
之后更新进度，比如处理10000行，而不是每行之后。
在后台线程中处理。它不会更快，但应用程序将负责。
正如Jeremy建议的那样，并行化代码。

<强>更新

以下示例代码演示了单词索引的想法：

static void ReplaceWords()
{
  string inputFileName = null;
  string outputFileName = null;

  // this dictionary maps each single word that can be found
  // in any keyphrase to a list of the keyphrases that contain it.
  IDictionary<string, IList<string>> singleWordMap = null;

  using (var source = new StreamReader(inputFileName))
  {
    using (var target = new StreamWriter(outputFileName))
    {
      string line;
      while ((line = source.ReadLine()) != null)
      {
        // first, we split each line into a single word - a unit of search
        var singleWords = SplitIntoWords(line);

        var result = new StringBuilder(line);
        // for each single word in the line
        foreach (var singleWord in singleWords)
        {
          // check if the word exists in any keyphrase we should replace
          // and if so, get the list of the related original keyphrases
          IList<string> interestingKeyPhrases;
          if (!singleWordMap.TryGetValue(singleWord, out interestingKeyPhrases))
            continue;

          Debug.Assert(interestingKeyPhrases != null && interestingKeyPhrases.Count > 0);

          // then process each of the keyphrases
          foreach (var interestingKeyphrase in interestingKeyPhrases)
          {
            // and replace it in the processed line if it exists
            result.Replace(interestingKeyphrase, GetTargetValue(interestingKeyphrase));
          }
        }

        // now, save the processed line
        target.WriteLine(result);
      }
    }
  }
}

private static string GetTargetValue(string interestingKeyword)
{
  throw new NotImplementedException();
}

static IEnumerable<string> SplitIntoWords(string keyphrase)
{
  throw new NotImplementedException();
}

代码显示了基本的想法：

我们将关键短语和处理过的行分成等效单位，可以有效地进行比较：单词。
我们存储一个字典，对于任何单词，我们都会快速提供包含该单词的所有关键短语的引用。
然后我们应用你原来的逻辑。但是，对于所有12万个关键短语，我们不会这样做，而是针对与处理线至少有单字交叉的关键短语的一小部分。

我将把剩下的实现留给你。

但是代码有几个问题：

SplitIntoWords实际上必须将单词规范化为某种规范形式。这取决于所需的逻辑。在最简单的情况下，您可能可以使用空格字符拆分和小写。但是，你可能需要进行形态匹配 - 这将更难（它非常接近全文搜索任务）。
为了速度，如果在处理输入之前为每个关键短语调用GetTargetValue方法，则可能会更好。
如果你的很多关键短语都有一致的话，你仍然会有大量的额外工作。在这种情况下，您需要在关键短语中保留关键字的位置，以便在处理输入行时使用单词距离计算来排除不相关的关键短语。
此外，我不确定StringBuilder在这种特殊情况下是否实际上更快。您应该尝试使用StringBuilder和string来查明真相。
毕竟这是一个样本。设计不是很好。我会考虑使用一致的接口提取一些类（例如KeywordsIndex）。

替换大文本文件中的长列表单词

2 个答案: