字符diff / annotation算法

时间:2011-02-22 14:53:34

标签: algorithm diff

我有一组代表文档历史的字符串。每个字符串都是整个文档 - 还没有任何差异分析。

我需要一个相对有效的算法来允许我使用它们来自的版本来注释文档的子字符串。

例如,如果文档历史记录是这样的:

Rev1: The quiet fox
Rev2: The quiet brown fox
Rev3: The quick brown fox

该算法将给出:

The quick brown fox
1111111331222222111

即。修订版1中添加了“qui”,修订版3中添加了“ck”,修订版1中添加了“”,修订版2中添加了“brown”,最后修订版1中添加了“fox”。

3 个答案:

答案 0 :(得分:3)

我有一个可以轻松完成此操作的类库,但我不知道它在大型或多次此类修订时的性能表现如何。

库在这里:DiffLib on CodePlex(您也可以通过NuGet安装它。)

问题中示例的脚本在这里(如果添加对DiffLib程序集的引用,可以在LINQPad中运行):

void Main()
{
    var revs = new string[]
    {
        "The quiet fox",
        "The quiet brown fox",
        "The quick brown fox",
        "The quick brown fox.",
        "The quick brown fox jumped over the lazy dog.",
        "The quick brown fox jumped over the lazy cat.",
        "The Quick Brown Fox jumped over the Lazy Cat.",
    };

    string current = revs[0];
    List<int> owner = new List<int>();
    foreach (char c in current)
        owner.Add(1); // owner 1 owns entire string

    Action<int> dumpRev = delegate(int rev)
    {
        Debug.WriteLine("rev " + rev);
        Debug.WriteLine(current);
        Debug.WriteLine(new string(owner.Select(i => (char)(48 + i)).ToArray()));
        Debug.WriteLine("");
    };
    dumpRev(0);

    for (int index = 1; index < revs.Length; index++)
    {
        int ownerId = index + 1;
        var diff = new DiffLib.Diff<char>(current, revs[index]).ToArray();
        int position = 0;
        foreach (var part in diff)
        {
            if (part.Equal)
                position += part.Length1;
            else
            {
                // get rid of old owner for the part that was
                // removed or replaced
                for (int index2 = 0; index2 < part.Length1; index2++)
                    owner.RemoveAt(position);

                // insert new owner for the part that was
                // added or did replace the old text
                for (int index2 = 0; index2 < part.Length2; index2++)
                    owner.Insert(position, ownerId);
                position += part.Length2;
            }
        }
        current = revs[index];
        dumpRev(index);
    }
}

输出:

rev 0
The quiet fox
1111111111111

rev 1
The quiet brown fox
1111111111222222111

rev 2
The quick brown fox
1111111331222222111

rev 3
The quick brown fox.
11111113312222221114

rev 4
The quick brown fox jumped over the lazy dog.
111111133122222211155555555555555555555555554

rev 5
The quick brown fox jumped over the lazy cat.
111111133122222211155555555555555555555556664

rev 6
The Quick Brown Fox jumped over the Lazy Cat.
111171133172222271155555555555555555755557664

答案 1 :(得分:1)

您想使用Myers diff algorithm as implemented by Google。它速度相当快,并且具有多种语言的实现,您可以提供超时值,以避免浪费太多时间来搜索复杂的差异。

输出应该非常简单地转换为你想要的那种得分(补丁的信用分配补丁)。

答案 2 :(得分:0)

您的“历史”格式是否已提供该信息?如果是这样,那么只需要显示它。当然,最有效的方法取决于你的历史存储格式,所以没有人知道这种格式,这里没有人可以真正为你提供。

应该注意的是,如果你要将输出发送到某种显示设备(例如:屏幕),那么通常你的算法必须真的愚蠢才能减慢速度比显示设备已经放慢了速度。