Levenshtein距离:从矩阵推断编辑操作

时间:2011-05-01 15:00:19

标签: algorithm levenshtein-distance

我在C ++中编写了Levenshtein算法

如果我输入:
string s:民主党人 string t:republican

我得到矩阵D填充并且可以在D [10] [8] = 8中读取操作次数(Levenshtein距离) 在填充矩阵之外,我想构建最优解。怎么看这个解决方案?我不知道。
请只写我如何看这个例子。

7 个答案:

答案 0 :(得分:36)

答案 1 :(得分:1)

自从我玩它以来已经有好几次了,但在我看来矩阵看起来应该是这样的:

. . r e p u b l i c a n
. 0 1 2 3 4 5 6 7 8 9 10
d 1 1 2 3 4 5 6 7 8 9 10
e 2 2 1 2 3 4 5 6 7 8 9
m 3 3 2 2 3 4 5 6 7 8 9
o 4 4 3 3 3 4 5 6 7 8 9
c 5 5 4 4 4 4 5 6 7 8 9
r 6 5 5 5 5 5 5 6 7 8 9
a 7 6 6 6 6 6 6 6 7 7 8
t 8 7 7 7 7 7 7 7 7 7 8

不要把它视为理所当然。

答案 2 :(得分:1)

这是一个基于mjv答案的VBA算法。 (很好解释,但有些案例不见了。)

Error: Failed to launch renderer
  at C:\Vineet\POC\node_modules\html5-to-pdf\src\phantom.coffee:55:35
  at Proto.apply (C:\Vineet\POC\node_modules\dnode-protocol\index.js:123:13)
  at Proto.handle (C:\Vineet\POC\node_modules\dnode-protocol\index.js:99:19)
  at D.dnode.handle (C:\Vineet\POC\node_modules\dnode\lib\dnode.js:140:21)
  at D.dnode.write (C:\Vineet\POC\node_modules\dnode\lib\dnode.js:128:22)
  at SockJSConnection.ondata (stream.js:31:26)
  at emitOne (events.js:96:13)
  at SockJSConnection.emit (events.js:188:7)
  at Session.didMessage (C:\Vineet\POC\node_modules\sockjs\lib\transport.js:220:25)
  at WebSocketReceiver.didMessage (C:\Vineet\POC\node_modules\sockjs\lib\trans-websocket.js:102:40)
  at C:\Vineet\POC\node_modules\sockjs\lib\trans-websocket.js:75:22
  at .<anonymous> (C:\Vineet\POC\node_modules\faye-websocket\lib\faye\websocket\api\event_target.js:41:7)
  at Array.forEach (native)
  at EventTarget.dispatchEvent (C:\Vineet\POC\node_modules\faye-websocket\lib\faye\websocket\api\event_target.js:40:33)
  at API.receive (C:\Vineet\POC\node_modules\faye-websocket\lib\faye\websocket\api.js:30:10)
  at instance._emitFrame (C:\Vineet\POC\node_modules\faye-websocket\lib\faye\websocket\hybi_parser.js:285:44)
  at instance.parse (C:\Vineet\POC\node_modules\faye-websocket\lib\faye\websocket\hybi_parser.js:143:18)
  at Socket.<anonymous> (C:\Vineet\POC\node_modules\faye-websocket\lib\faye\websocket.js:72:33)
  at emitOne (events.js:96:13)
  at Socket.emit (events.js:188:7)
  at readableAddChunk (_stream_readable.js:172:18)
  at Socket.Readable.push (_stream_readable.js:130:10)
  at TCP.onread (net.js:542:20)

答案 3 :(得分:0)

我最近使用Levenshtein距离算法的矩阵做了一些工作。我需要生成将一个列表转换为另一个列表的操作。 (这也适用于字符串。)

以下(誓言)测试是否显示您正在寻找的功能?

  , "lev - complex 2"
  : { topic
    : lev.diff([13, 6, 5, 1, 8, 9, 2, 15, 12, 7, 11], [9, 13, 6, 5, 1, 8, 2, 15, 12, 11])
    , "check actions"
    : function(topic) { assert.deepEqual(topic, [{ op: 'delete', pos: 9, val: 7 },
                                                 { op: 'delete', pos: 5, val: 9 },
                                                 { op: 'insert', pos: 0, val: 9 },
                                                ]); }
    }
  , "lev - complex 3"
  : { topic
    : lev.diff([9, 13, 6, 5, 1, 8, 2, 15, 12, 11], [13, 6, 5, 1, 8, 9, 2, 15, 12, 7, 11])
    , "check actions"
    : function(topic) { assert.deepEqual(topic, [{ op: 'delete', pos: 0, val: 9 },
                                                 { op: 'insert', pos: 5, val: 9 },
                                                 { op: 'insert', pos: 9, val: 7 }
                                                ]); }
    }
  , "lev - complex 4"
  : { topic
    : lev.diff([9, 13, 6, 5, 1, 8, 2, 15, 12, 11, 16], [13, 6, 5, 1, 8, 9, 2, 15, 12, 7, 11, 17])
    , "check actions"
    : function(topic) { assert.deepEqual(topic, [{ op: 'delete', pos: 0, val: 9 },
                                                 { op: 'insert', pos: 5, val: 9 },
                                                 { op: 'insert', pos: 9, val: 7 },
                                                 { op: 'replace', pos: 11, val: 17 }
                                                ]); }
    }

答案 4 :(得分:0)

这是一些Matlab代码,您的意见是否正确?似乎给出了正确的结果:)

clear all

s = char('democrat');
t = char('republican');

% Edit Matrix
m=length(s);
n=length(t);
mat=zeros(m+1,n+1);
for i=1:1:m
    mat(i+1,1)=i;
end
for j=1:1:n
    mat(1,j+1)=j;
end
for i=1:m
    for j=1:n
        if (s(i) == t(j))
            mat(i+1,j+1)=mat(i,j);
        else
            mat(i+1,j+1)=1+min(min(mat(i+1,j),mat(i,j+1)),mat(i,j));
        end
    end
end

% Edit Sequence
s = char('democrat');
t = char('republican');
i = m+1;
j = n+1;
display([s ' --> ' t])
while(i ~= 1 && j ~= 1)
    temp = min(min(mat(i-1,j-1), mat(i,j-1)), mat(i-1,j));
    if(mat(i-1,j) == temp)
        i = i - 1;
        t = [t(1:j-1) s(i) t(j:end)];
        disp(strcat(['iinsertion: i=' int2str(i) ' , j=' int2str(j) ' ; ' s ' --> ' t]))
    elseif(mat(i-1,j-1) == temp)
        if(mat(i-1,j-1) == mat(i,j))
            i = i - 1;
            j = j - 1;
            disp(strcat(['uunchanged: i=' int2str(i) ' , j=' int2str(j) ' ; ' s ' --> ' t]))
        else
            i = i - 1;
            j = j - 1;
            t(j) = s(i);
            disp(strcat(['substition: i=' int2str(i) ' , j=' int2str(j) ' ; ' s ' --> ' t]))
        end
    elseif(mat(i,j-1) == temp)
        j = j - 1;
        t(j) = [];
        disp(strcat(['dddeletion: i=' int2str(i) ' , j=' int2str(j) ' ; ' s ' --> ' t]))
    end
end

答案 5 :(得分:0)

JackIsJack的C#实现回答了一些变化:

  • 操作以“前进”顺序输出(JackIsJack以相反顺序输出);
  • 原始答案中的最后'else'子句工作不正确(看起来像复制粘贴错误)。

控制台应用程序代码:

class Program
{
    static void Main(string[] args)
    {
        Levenshtein("1", "1234567890");
        Levenshtein( "1234567890", "1");

        Levenshtein("kitten", "mittens");
        Levenshtein("mittens", "kitten");
        Levenshtein("kitten", "sitting");
        Levenshtein("sitting", "kitten");
        Levenshtein("1234567890", "12356790");
        Levenshtein("12356790", "1234567890");
        Levenshtein("ceci est un test", "ceci n'est pas un test");
        Levenshtein("ceci n'est pas un test", "ceci est un test");
    }

    static void Levenshtein(string string1, string string2)
    {
        Console.WriteLine("Levenstein '" + string1 + "' => '" + string2 + "'");

        var string1_length = string1.Length;
        var string2_length = string2.Length;

        int[,] distance = new int[string1_length + 1, string2_length + 1];

        for (int i = 0; i <= string1_length; i++)
        {
            distance[i, 0] = i;
        }


        for (int j = 0; j <= string2_length; j++)
        {
            distance[0, j] = j;
        }


        for (int i = 1; i <= string1_length; i++)
        {
            for (int j = 1; j <= string2_length; j++)
            {
                if (string1[i - 1] == string2[j - 1])
                {
                    distance[i, j] = distance[i - 1, j - 1];
                }
                else
                {
                    distance[i, j] = Math.Min(distance[i - 1, j] + 1, Math.Min(
                       distance[i, j - 1] + 1,
                       distance[i - 1, j - 1] + 1));
                }

            }
        }


        var LevenshteinDistance = distance[string1_length, string2_length];// for information only
        Console.WriteLine($"Levernstein distance: {LevenshteinDistance}");

        // List of operations
        var current_posx = string1_length;
        var current_posy = string2_length;

        var stack = new Stack<string>(); // for outputting messages in forward direction

        while (current_posx != 0 || current_posy != 0)
        {
            var cc = distance[current_posx, current_posy];
            // edge cases
            if (current_posy - 1 < 0)
            {
                stack.Push("Delete '" + string1[current_posx - 1] + "'");
                current_posx--;
                continue;
            }

            if (current_posx - 1 < 0)
            {
                stack.Push("Insert '" + string2[current_posy - 1] + "'");
                current_posy--;
                continue;
            }

            // Middle cases
            var cc_L = distance[current_posx, current_posy - 1];
            var cc_U = distance[current_posx - 1, current_posy];
            var cc_D = distance[current_posx - 1, current_posy - 1];

            if ((cc_D <= cc_L && cc_D <= cc_U) && (cc_D == cc - 1 || cc_D == cc))
            {
                if (cc_D == cc - 1)
                {
                    stack.Push("Substitute '" + string1[current_posx - 1] + "' by '" + string2[current_posy - 1] + "'");
                    current_posx--;
                    current_posy--;
                }
                else
                {
                    stack.Push("Keep '" + string1[current_posx - 1] + "'");
                    current_posx--;
                    current_posy--;
                }
            }
            else if (cc_L <= cc_D && cc_L == cc - 1)
            {
                stack.Push("Insert '" + string2[current_posy - 1] + "'");
                current_posy--;                   
            }
            else
            {
                stack.Push("Delete '" + string1[current_posx - 1]+"'");
                current_posx--;                   
            }
        }

        while(stack.Count > 0)
        {
            Console.WriteLine(stack.Pop());
        }
    }
}

答案 6 :(得分:0)

从python中实现的矩阵中推断出移动的回溯算法:

    def _backtrack_string(matrix, output_word):
    '''
    Iteratively backtrack DP matrix to get optimal set of moves

    Inputs: DP matrix (list:list:int),
            Input word (str),
            Output word (str),
            Start x position in DP matrix (int),
            Start y position in DP matrix (int)
    Output: Optimal path (list)
    '''

    i = len(matrix) - 1
    j = len(matrix[0]) - 1
    optimal_path = []
    while i > 0 and j > 0:
        diagonal = matrix[i-1][j-1]
        vertical = matrix[i-1][j]
        horizontal = matrix[i][j-1]
        current = matrix[i][j]
        if diagonal <= vertical and diagonal <= horizontal and (diagonal <= current):
            i = i - 1
            j = j - 1
            if diagonal == current - 1:
                optimal_path.append("Replace " + str(j) + ", " + str(output_word[j]) )
            elif horizontal <= vertical and horizontal <= current:
                j = j - 1
                optimal_path.append("Insert " + str(j) + ", " + str(output_word[j]))
            elif vertical <= horizontal and vertical <= current:
                i = i - 1
                optimal_path.append("Delete " + str(i))
        elif horizontal <= vertical and horizontal <= current:
            j = j - 1
            optimal_path.append("Insert " + str(j) + ", " + str(output_word[j]))
        else:
            i = i - 1
            optimal_path.append("Delete " + str(i))

    return reversed(optimal_path)

使用原始单词“ OPERATING”和所需单词“ CONSTANTINE”运行算法时得到的输出如下

    Insert 0, C
    Replace 2, N
    Replace 3, S
    Replace 4, T
    Insert 6, N
    Replace 10, E

       ""  C  O  N  S  T  A  N  T  I  N   E

    "" [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] 
              <--                              Insert 0, C
    O  [1, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9,  10]
                \                              Replace 2, N
    P  [2, 2, 2, 2, 3, 4, 5, 6, 7, 8, 9,  10]
                   \                           Replace 3, S
    E  [3, 3, 3, 3, 3, 4, 5, 6, 7, 8, 9,  9]
                      \                        Replace 4, T
    R  [4, 4, 4, 4, 4, 4, 5, 6, 7, 8, 9,  10]  No move
                         \ <--                 Insert 6, N
    A  [5, 5, 5, 5, 5, 5, 4, 5, 6, 7, 8,  9]
                               \               No move
    T  [6, 6, 6, 6, 6, 5, 5, 5, 5, 6, 7,  8]
                                  \            No move
    I  [7, 7, 7, 7, 7, 6, 6, 6, 6, 5, 6,  7]
                                     \         No move
    N  [8, 8, 8, 7, 8, 7, 7, 6, 7, 6, 5,  6]
                                        \      Replace 10, E
    G  [9, 9, 9, 8, 8, 8, 8, 7, 7, 7, 6,  6]

请注意,如果对角线上的元素与当前元素相同,则必须添加额外的条件。取决于垂直(上)和水平(左)位置中的值,可能会有删除或插入。当发生以下情况时,我们只会得到“无操作”或“替换”操作

# assume bottom right of a 2x2 matrix is the reference position 
# and has value v
# the following is the situation where we get a replace operation
    [v + 1 , v<]
    [  v<  , v]
# the following is the situation where we get a "no operation"
    [v , v<]
    [v<, v ] 

我认为这是第一个答案中描述的算法可能会中断的地方。当两种操作都不正确时,上面的2x2矩阵中可能还有其他布置。输入“ OPERATING”和输出“ CONSTANTINE”显示的示例破坏了算法,除非将其考虑在内。