Question

我正在尝试调整Boyer-Moore c（++）Wikipedia implementation来获取字符串中模式的所有匹配项。实际上，Wikipedia实现返回第一个匹配。主要代码如下：

char* boyer_moore (uint8_t *string, uint32_t stringlen, uint8_t *pat, uint32_t patlen) {
    int i;
    int delta1[ALPHABET_LEN];
    int *delta2 = malloc(patlen * sizeof(int));
    make_delta1(delta1, pat, patlen);
    make_delta2(delta2, pat, patlen);

    i = patlen-1;
    while (i < stringlen) {
        int j = patlen-1;
        while (j >= 0 && (string[i] == pat[j])) {
            --i;
            --j;
        }
        if (j < 0) {
            free(delta2);
            return (string + i+1);
        }

        i += max(delta1[string[i]], delta2[j]);
    }
    free(delta2);
    return NULL;
}

我试图在if (j < 0)之后修改块以将索引添加到数组/向量并让外循环继续，但它似乎不起作用。在测试修改后的代码时，我仍然只得到一个匹配。也许这个实现不是为了返回所有匹配而设计的，它需要进行多次快速更改才能完成？我不太了解算法本身，所以我不确定如何使这项工作。如果有人能指出我正确的方向，我将不胜感激。

注意：函数make_delta1和make_delta2是在源代码中定义的（检查维基百科页面），而max（）函数调用实际上是一个宏，它也是在源代码中先前定义的。

Answer 1

Boyer-Moore的算法利用了这样一个事实：当你在一个较长的字符串中搜索“HELLO WORLD”时，你在给定位置找到的字母会限制在该位置找到的字符，如果要在所有，一种海战对战：如果你在边界的四个细胞处发现大海，你不需要测试其余四个细胞，以防有5个细胞载体隐藏在那里;不可能。

如果您在第11个位置找到“D”，那么它可能是HELLO WORLD的最后一个字母;但如果你发现'Q'，'Q'不在HELLO WORLD内的任何地方，这意味着搜索到的字符串不能出现在前11个字符中的任何位置，并且你可以避免在那里完全搜索。另一方面，'L'可能意味着HELLO WORLD在那里，从11-3位（HELLO WORLD的第三个字母是L），11-4或11-10开始。

搜索时，您可以使用两个delta阵列跟踪这些可能性。

所以当你找到一个模式时，你应该这样做，

if (j < 0)
{
    // Found a pattern from position i+1 to i+1+patlen
    // Add vector or whatever is needed; check we don't overflow it.
    if (index_size+1 >= index_counter)
    {
        index[index_counter] = 0;
        return index_size;
    }
    index[index_counter++] = i+1;

    // Reinitialize j to restart search
    j = patlen-1;

    // Reinitialize i to start at i+1+patlen
    i += patlen +1; // (not completely sure of that +1)

    // Do not free delta2
    // free(delta2);

    // Continue loop without altering i again
    continue;
}
i += max(delta1[string[i]], delta2[j]);
}
free(delta2);
index[index_counter] = 0;
return index_counter;

如果您将size_t *indexes之类的内容传递给函数，则应返回零终止的索引列表。

该函数将返回0（未找到），index_size（匹配太多）或1和index_size-1之间的匹配数。

这允许例如添加额外的匹配而不必重复搜索已经找到的（index_size-1）子串;你通过new_num，num_indexes realloc数组增加indexes，然后将函数传递给偏移old_index_size-1的新数组，new_num作为新大小，并且干草堆字符串开始来自索引old_index_size-1的匹配偏移加上一个（不，正如我在之前的修订版中写的那样，加上针串的长度< / em>;见评论）。

此方法也会报告重叠匹配，例如在 banana 中搜索 ana 会找到b * ana * na和ban * ANA *

<强>更新

我测试了上面的内容，似乎有效。我通过添加这两个包含来修改维基百科代码，以防止gcc抱怨

#include <stdio.h> #include <string.h>

然后我修改了if (j < 0)以简单地输出它找到的内容

if (j < 0) { printf("Found %s at offset %d: %s\n", pat, i+1, string+i+1); //free(delta2); // return (string + i+1); i += patlen + 1; j = patlen - 1; continue; }

最后我用这个
进行了测试
int main(void) { char *s = "This is a string in which I am going to look for a string I will string along"; char *p = "string"; boyer_moore(s, strlen(s), p, strlen(p)); return 0; }

得到了，正如所料：

Found string at offset 10: string in which I am going to look for a string I will string along Found string at offset 51: string I will string along Found string at offset 65: string along

如果字符串包含两个重叠序列，则找到BOTH：

char *s = "This is an andean andeandean andean trouble"; char *p = "andean"; Found andean at offset 11: andean andeandean andean trouble Found andean at offset 18: andeandean andean trouble Found andean at offset 22: andean andean trouble Found andean at offset 29: andean trouble

为避免重叠匹配，最快的方法是不存储重叠。它可以在函数中完成，但它意味着重新初始化第一个delta向量并更新字符串指针;我们还需要将第二个i索引存储为i2，以防止保存的索引变为非单调。这不值得。更好：

if (j < 0) { // We have found a patlen match at i+1 // Is it an overlap? if (index && (indexes[index] + patlen < i+1)) { // Yes, it is. So we don't store it. // We could store the last of several overlaps // It's not exactly trivial, though: // searching 'anana' in 'Bananananana' // finds FOUR matches, and the fourth is NOT overlapped // with the first. So in case of overlap, if we want to keep // the LAST of the bunch, we must save info somewhere else, // say last_conflicting_overlap, and check twice. // Then again, the third match (which is the last to overlap // with the first) would overlap with the fourth. // So the "return as many non overlapping matches as possible" // is actually accomplished by doing NOTHING in this branch of the IF. } else { // Not an overlap, so store it. indexes[++index] = i+1; if (index == max_indexes) // Too many matches already found? break; // Stop searching and return found so far } // Adapt i and j to keep searching i += patlen + 1; j = patlen - 1; continue; }

适应Boyer-Moore实施

1 个答案: