尽可能快地在SQL Server数据库中存储大量ngram

时间:2015-02-22 14:53:40

标签: c# performance nlp sql-server-2014-express

我有14个准备好的基于单词的3克文件,txt文件的总大小是75GB。 ngram由";"分隔。并且单词序列后面的单词由" |"分隔。现在我想计算一个单词遵循3字序列的频率。由于我需要尽可能快地完成数据量。

我的方法是:

  1. 按分隔符;
  2. 按行分割
  3. 按分隔符|
  4. 拆分ngram
  5. 将ngram存储在两个表sequenceswords中,并计算该单词在words
  6. 中对该序列的显示频率

    我有SQL Server 2014 Express,我的表具有以下结构:

    • [dbo].[sequences]Id | Sequence
    • [dbo].[words]Id | sid | word | count

    序列表应该是清晰的,在单词表中sid是相关的序列id,单词是单词字符串,count是int数字,它计算单词在该序列之后出现的频率

    我的以下解决方案需要在每行开始大约1秒,这是非常慢的。我试图使用Parallel,但后来我得到一个SQL错误,我猜是因为当另一个进程插入某些东西时表被锁定。

    我的节目:

        static void Main(string[] args)
        {
            DateTime begin = DateTime.Now;
            SqlConnection myConnection = new SqlConnection(@"Data Source=(localdb)\Projects;Database=ngrams;Integrated Security=True;Connect Timeout=30;Encrypt=False;TrustServerCertificate=False");
            myConnection.Open();
            for (int i = 0; i < 14; i++)
            {
                using (FileStream fs = File.Open(@"F:\Documents\ngrams\prepared_" + i + ".txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
                using (BufferedStream bs = new BufferedStream(fs))
                using (StreamReader sr = new StreamReader(bs))
                {
                    string line;
                    int a = 0;
                    while ((line = sr.ReadLine()) != null)
                    {
                        string[] ngrams = line.Split(new char[] { ';' });
                        foreach (string ngram in ngrams)
                        {
                            string[] gram = ngram.Split(new Char[] { '|' });
                            if (gram.Length > 1)
                            {
                                string sequence = gram[0];
                                string word = gram[1];
                                storeNgrams(myConnection, sequence, word);
                            }
                        }
                        Console.WriteLine(DateTime.Now.Subtract(begin).TotalMinutes);
                        a++;
                    }
                }
            }
    
            Console.WriteLine("Processed 75 Gigabyte in hours: " + DateTime.Now.Subtract(begin).TotalHours);
        }
    
        private static void storeNgrams(SqlConnection myConnection, string sequence, string word)
        {
            SqlCommand insSeq = new SqlCommand("INSERT INTO sequences (sequence) VALUES (@sequence); SELECT SCOPE_IDENTITY()", myConnection);
            SqlCommand insWord = new SqlCommand("INSERT INTO words (sid, word, count) VALUES (@sid, @word, @count)", myConnection);
            SqlCommand updateWordCount = new SqlCommand("UPDATE words SET count = @count WHERE sid = @sid AND word = @word", myConnection);
            SqlCommand searchSeq = new SqlCommand("SELECT Id from sequences WHERE sequence = @sequence", myConnection);
            SqlCommand getWordCount = new SqlCommand("Select count from words WHERE sid = @sid AND word = @word", myConnection);
            searchSeq.Parameters.AddWithValue("@sequence", sequence);
            object searchSeq_obj = searchSeq.ExecuteScalar();
            if (searchSeq_obj != null)
            {
                insNgram(insWord, updateWordCount, getWordCount, searchSeq_obj, word).ExecuteNonQuery();
            }
            else
            {
                insSeq.Parameters.AddWithValue("@sequence", sequence);
                object sid_obj = insSeq.ExecuteScalar();
                if (sid_obj != null)
                {
                    insNgram(insWord, updateWordCount, getWordCount, sid_obj, word).ExecuteNonQuery();
                }
            }
        }
    
        private static SqlCommand insNgram(SqlCommand insWord, SqlCommand updateWordCount, SqlCommand getWordCount, object sid_obj, string word)
        {
            int sid = Convert.ToInt32(sid_obj);
            getWordCount.Parameters.AddWithValue("@sid", sid);
            getWordCount.Parameters.AddWithValue("@word", word);
            object wordCount_obj = getWordCount.ExecuteScalar();
            if (wordCount_obj != null)
            {
                int wordCount = Convert.ToInt32(wordCount_obj) + 1;
                return storeWord(updateWordCount, sid, word, wordCount);
            }
            else
            {
                int wordCount = 1;
                return storeWord(insWord, sid, word, wordCount);
            }
        }
    
        private static SqlCommand storeWord(SqlCommand updateWord, int sid, string word, int wordCount)
        {
            updateWord.Parameters.AddWithValue("@sid", sid);
            updateWord.Parameters.AddWithValue("@word", word);
            updateWord.Parameters.AddWithValue("@count", wordCount);
            return updateWord;
        }
    

    如何更快地处理ngrams,以便我不需要过多的时间?

    P.S。:我对C#和自然语言处理完全陌生。

    修改1 : 根据要求提供样本ngram,每行约4或5(但当然有不同的单词组合):大致相同|像;

    编辑2: 当我将代码更改为以下内容时,我收到错误 System.AggregateException:至少发生一次失败---&gt; System.InvalidOperationException:已经有一个与此命令关联的打开DataReader,必须先关闭。,就像here一样。

     Parallel.For(0, 14, i => sqlaction(myConnection, i, begin));
    

    编辑3: MultipleActiveResultSets = true 添加到连接字符串时,我不会使用Parallel获得任何错误。我用Parallel等效替换了所有相关的循环,并且我遍历所有文件只计算行号(169521628行),我也计算了1行所需的平均时间,即0,051502946秒。即便如此,我还需要大约101天!

0 个答案:

没有答案