Question

是否有任何算法可用于查找字符串中最常见的短语（或子串）？例如，以下字符串将“hello world”作为其最常见的双字短语：

"hello world this is hello world. hello world repeats three times in this string!"

在上面的字符串中，最常见的字符串（在空字符串字符之后，重复无限次）将是空格字符。

有没有办法生成此字符串中常见子串的列表，从最常见到最不常见？

Answer 1

这与Nussinov算法的任务类似，实际上更简单，因为我们不允许在对齐中出现任何间隙，插入或不匹配。

对于长度为N的字符串A，定义F[-1 .. N, -1 .. N]表并使用以下规则填写：

  for i = 0 to N
    for j = 0 to N
      if i != j
        {
          if A[i] == A[j]
            F[i,j] = F [i-1,j-1] + 1;
          else
            F[i,j] = 0;
        }

例如，对于 B A O B A B：

AlgChart

此时间为O(n^2)。表中的最大值现在指向最长自匹配子序列的结束位置（i - 一个出现的结束，j - 另一个）。在开始时，假设该数组是零初始化的。我添加条件来排除最长但可能没有趣的自我匹配的对角线。

更多地考虑，这个表在对角线上是对称的，所以它只能计算一半。此外，该阵列初始化为零，因此分配零是多余的。那仍然是

  for i = 0 to N
    for j = i + 1 to N
      if A[i] == A[j]
         F[i,j] = F [i-1,j-1] + 1;

更短但可能更难理解。计算表包含所有匹配，short和long。您可以根据需要添加进一步的过滤。

在下一步，你需要恢复字符串，从非零单元格向上和向左对角线。在此步骤中，使用一些散列映射来计算同一字符串的自相似性匹配的数量也很简单。使用普通字符串和正常最小长度时，只会通过此映射处理少量表格单元格。

我认为直接使用hashmap实际上需要O（n ^ 3），因为必须以某种方式比较访问结束时的关键字符串是否相等。这种比较可能是O（n）。

Answer 2

的Python。这有点快速和肮脏，数据结构完成了大部分工作。

from collections import Counter
accumulator = Counter()
text = 'hello world this is hello world.'
for length in range(1,len(text)+1):
    for start in range(len(text) - length):
        accumulator[text[start:start+length]] += 1

Counter结构是一个哈希支持的字典，用于计算你看到过多少次。添加到不存在的密钥将创建它，而检索不存在的密钥将给出零而不是错误。所以你要做的就是迭代所有的子串。

Answer 3

只是伪代码，也许这不是最美丽的解决方案，但我会这样解决：

function separateWords(String incomingString) returns StringArray{
  //Code
}

function findMax(Map map) returns String{
  //Code
}

function mainAlgorithm(String incomingString) returns String{
    StringArray sArr = separateWords(incomingString);
    Map<String, Integer> map; //init with no content
    for(word: sArr){
        Integer count = map.get(word);
        if(count == null){
            map.put(word,1);
        } else {
            //remove if neccessary
            map.put(word,count++); 
        }
   }
   return findMax(map);
}

where map可以包含一个键，值对，如Java HashMap。

Answer 4

Perl，O(n²)解决方案

my $str = "hello world this is hello world. hello world repeats three times in this string!";

my @words = split(/[^a-z]+/i, $str);
my ($display,$ix,$i,%ocur) = 10;

# calculate

for ($ix=0 ; $ix<=$#words ; $ix++) {
  for ($i=$ix ; $i<=$#words ; $i++) {
    $ocur{ join(':', @words[$ix .. $i]) }++;
  }
}

# display 

foreach (sort { my $c = $ocur{$b} <=> $ocur{$a} ; return $c ? $c : split(/:/,$b)-split(/:/,$a); } keys %ocur) {
  print "$_: $ocur{$_}\n";
  last if !--$display;
}

显示最常见子字符串的10个最佳分数（如果是平局，则首先显示最长的单词链）。将$display更改为1仅包含结果。
有n(n+1)/2次迭代。

Answer 5

因为对于长度大于＆gt; = 2的字符串的每个子字符串，文本至少包含至少一个长度为2的子字符串，所以我们只需要调查长度为2的子字符串。

val s = "hello world this is hello world. hello world repeats three times in this string!"

val li = s.sliding (2, 1).toList
// li: List[String] = List(he, el, ll, lo, "o ", " w", wo, or, rl, ld, "d ", " t", th, hi, is, "s ", " i", is, "s ", " h", he, el, ll, lo, "o ", " w", wo, or, rl, ld, d., ". ", " h", he, el, ll, lo, "o ", " w", wo, or, rl, ld, "d ", " r", re, ep, pe, ea, at, ts, "s ", " t", th, hr, re, ee, "e ", " t", ti, im, me, es, "s ", " i", in, "n ", " t", th, hi, is, "s ", " s", st, tr, ri, in, ng, g!)

val uniques = li.toSet
uniques.toList.map (u => li.count (_ == u))
// res18: List[Int] = List(1, 2, 1, 1, 3, 1, 5, 1, 1, 3, 1, 1, 3, 2, 1, 3, 1, 3, 2, 3, 1, 1, 1, 1, 1, 3, 1, 3, 3, 1, 3, 1, 1, 1, 3, 3, 2, 4, 1, 2, 2, 1)

uniques.toList(6)
res19: String = "s "

用于查找字符串中最常见子字符串的算法

5 个答案: