aho corasick算法的状态转换表

时间:2014-03-14 06:59:55

标签: algorithm pattern-matching aho-corasick

请帮助我理解Aho-Corasick算法中多个模式的状态转换表的构造。

请提供一个简单而详细的解释,以便我理解。

我正在关注this论文,here就是动画。

感谢。

1 个答案:

答案 0 :(得分:9)

第1阶段

创建关键字树

Starting at the root, follow the path labeled by chars of Pi
        If the path ends before Pi, continue it by adding new edges and ...
                            nodes for the remaining characters of P
        Store identifier i of Pi at the terminal node of the path

我通过一个例子来证明:

假设我们有一组有限的模式P = {他,她,他的,她的}。

enter image description here

阶段2

接下来,我们将关键字树扩展为自动机,以支持线性时间匹配,如下所示。

显然,自动机由两部分组成:

  • 状态
  • 转换

状态:正是关键字树实现的节点;和初始状态=树的根。

enter image description here

转换:使用 转到(q,a) 功能,如下所示。

/*This function gives the state entered from current state q 
     by matching target char a*/
- If edge (q,v) in the tree is labeled by a, then g(q,a)=v; 
- g(0,a)=0 for each a that does not label an edge out of the root
//the automton stays at the initial state while scanning non-matching characters       
- O.W. g(q,a)={}

enter image description here

使用自动机

SearchofTarget T[1..m] //considering the target string T as an 
                     array of chars indexed from 1   to m
q:= 0; // initial state(root)
for i:= 1 to m do
    while g(q,T[i]) == {}  do 
        q:= fail(q); // follow a fail
    q:= goto(q,T[i]); // follow a goto
    if output(q) != {} then print i, out(q);
endfor;

正如您在上面的伪代码中看到的,它调用了两个函数: fail(q) output(q)

失败(q) :// for q!= 0表示输入的状态不匹配

failure(q) is the node labeled by the longest proper suffix  w of L(q) ...
    s.t. w is a prefix of some pattern
//L(q) is the label of node q as the concatenation of edge labels ...
   on the path from the root to q

输出(q)

gives the set of patterns recognized when entering state q

计算完这两个函数后,自动机看起来像这样:

enter image description here

现在您可能想知道何时计算这些方法,因此请查看以下更多官方表格:

enter image description here

希望这有帮助,如果仍然有些含糊不清,请不要犹豫。