Question

假设有一个文档包含许多重复语句（如日志消息）。例如（a b d c e a d）将每个字母表视为一个句子。

我们需要找出所有可能的唯一序列及其计数。例如（abd = 1.bd = 1.ad = 1等等）

我们给出了序列中可以包含的句子数量的最小值和最大值。

我们如何在空间和时间方面最有效地做到这一点？

我尝试将其编码为两个步骤中的树问题（找到可能的组合然后计数）。我查看了后缀树，但考虑到我们处理的是句子，空间复杂性可能很大

Answer 1

我会做以下事情：

将所有句子映射到整数（使用哈希映射）。
为生成的整数数组构建后缀树/后缀数组/后缀自动机。

Answer 2

对于所有可能的字符串（一行/ 2行/..//所有行）计算和计算散列。

的伪代码：

for (firstline=0; firstline< N_LINES; firstline++) {
   for (lastline=firstline; lastline<N_LINES; lastline++) {
      newhash=calculateHash(firstline until lastline);
      count_and_store(newhash)
   }
}

编辑：不要使用蛮力

跳过检查的解决方案是首先检查1行/ 2行哈希。您开始使用1个日志进行检查。当第一个日志的子字符串不唯一时，您只需要继续检查更多的日志。

    Boolean array array_lineNumber [N_LINES]; # Init this array all values false 
    Hashtype array substringCalculatedHash [N_LINES]; # Do not recalculate hash when checking
Hashtype array fullstringCalculatedHash [N_LINES]; # Store hashes for current length

    function checkRepeatedHashSubstring(startline) {
       if (numberOfHashesEqualTo(substringCalculatedHash[startline])>1) {
          return true;
       } else {
          return false;
       }
    }

    for (numberlines=1; numberlines<N_lines; numberlines++)
    {
        for (firstline=0; firstline< N_LINES; firstline++) {
           if (array_lineNumber_unique[firstline]==true) {
              # Unique log combination since the substring was unique as well.
              # Just fill the hashmap for the current number_lines with the unique hash calculated for numberlines-1
              fullstringCalculatedHash[firstline]=substringCalculatedHash[firstline];
              continue; # Skip checking.
           }
           if (checkRepeatedHashSubstring(firstline) {
              newhash=calculateHash(firstline until lastline);
              fullstringCalculatedHash[firstline]=newhash;
           } else {
              # substring was unique, so complete string is unique as well
               fullstringCalculatedHash[firstline]=substringCalculatedHash[firstline];
              array_lineNumber_unique[firstline]=true;
           }
           count_and_logSubstringsCalculated(copy_fullstringCalculatedHash);
           copy_fullstringCalculatedHash_to_substringCalculatedHash();  
        }
}

在文档中查找重复的语句

2 个答案: