编辑距离:忽略开始/结束

时间:2017-07-22 12:43:03

标签: python algorithm perl fuzzy-search

我正在寻找一种能够编辑距离的算法,但会在一个字符串和空格中忽略start + end:

edit("four","foor") = 1
edit("four","noise fo or blur") = 1

是否有现有的算法?甚至可能是Perl或Python库?

2 个答案:

答案 0 :(得分:4)

执行此操作的代码在概念上很简单。这是你想要忽略的想法,你可以自己添加:

#!perl
use v5.22;
use feature qw(signatures);
no warnings qw(experimental::signatures);

use Text::Levenshtein qw(distance);

say edit( "four", "foor" );
say edit( "four", "noise fo or blur" );

sub edit ( $start, $target ) {
    # transform strings to ignore what you want
    # ...
    distance( $start, $target )
    }

也许你想检查所有相同长度的子串:

use v5.22;
use feature qw(signatures);
no warnings qw(experimental::signatures);

use Text::Levenshtein qw(distance);

say edit( "four", "foar" );
say edit( "four", "noise fo or blur" );

sub edit ( $start, $target ) {
    my $start_length = length $start;
    $target =~ s/\s+//g;
    my @all_n_chars = map {
        substr $target, $_, 4
        } 0 .. ( length($target) - $start_length );

    my $closest;
    my $closest_distance = $start_length + 1;
    foreach ( @all_n_chars ) {
        my $distance = distance( $start, $_ );
        if( $distance < $closest_distance ) {
            $closest = $_;
            $closest_distance = $distance;
            say "closest: $closest Distance: $distance";
            last if $distance == 0;
            }
        }

    return $closest_distance;
    }

这个非常简单的实现找到了你想要的东西。但是,要意识到其他随机字符串可能会意外地具有较低的编辑距离。

closest: foar Distance: 1
1
closest: nois Distance: 3
closest: foor Distance: 1
1

你可以扩展它以记住每个字符串的真正起始位置,这样你就可以在原版中再次找到它,但这应该足以让你在路上发送。如果你想使用Python,我认为该程序看起来可能非常相似。

答案 1 :(得分:4)

这是一个Perl 6解决方案。我使用的语法知道如何抓住四个有趣的角色尽管有插页式的东西。更复杂的要求需要不同的语法,但这并不是那么难。

每次匹配时,NString :: Actions类对象都会进行更改以检查匹配。它做了我以前做的同样高水位的事情。这看起来像是一堆更多的工作,这是为了这个微不足道的例子。对于更复杂的例子,它不会那么糟糕。我的Perl 5版本必须做很多工具来确定要保留或不保留的内容。

use Text::Levenshtein;

my $string = 'The quixotic purple and jasmine butterfly flew over the quick zany dog';

grammar NString {
    regex n-chars      { [<.ignore-chars>* \w]**4 }
    regex ignore-chars { \s }
    }

class NString::Actions {
    # See 
    my subset IntInf where Int:D | Inf;

    has        $.target;
    has Str    $.closest          is rw = '';
    has IntInf $.closest-distance is rw = Inf;

    method n-chars ($/) {
        my $string = $/.subst: /\s+/, '', :g;

        my $distance = distance( $string,  self.target );
        # say "Matched <$/>. Distance for $string is $distance";
        if $distance < self.closest-distance {
            self.closest = $string;
            self.closest-distance = $distance;
            }
        }
    }

my $action =  NString::Actions.new: target => 'Perl';

loop {
    state $from = 0;
    my $match = NString.subparse(
        $string,
        :rule('n-chars'),
        :actions($action),
        :c($from)
        );
    last unless ?$match;

    $from++;
    }

say "Shortest is { $action.closest } with { $action.closest-distance }";

(我从Perl 5那里做了一个直接的端口,我将离开这里)

我在Perl 6中尝试了同样的事情,但我确信这有点冗长。我想知道是否有一种聪明的方法来抓住N个字符组进行比较。也许我以后会有所改善。

use Text::Levenshtein;

put edit( "four", "foar" );
put edit( "four", "noise fo or blur" );

sub edit ( Str:D $start, Str:D $target --> Int:D ) {
    my $target-modified = $target.subst: rx/\s+/, '',  :g;

    my $last-position-to-check = [-] map { .chars }, $target-modified, $start;

    my $closest = Any;
    my $closest-distance = $start.chars + 1;
    for 0..$last-position-to-check -> $starting-pos {
        my $substr = $target-modified.substr: $starting-pos, $start.chars;
        my $this-distance = distance( $start, $substr );
        put "So far: $substr -> $this-distance";
        if $this-distance < $closest-distance {
            $closest          = $substr;
            $closest-distance = $this-distance;
            }
        last if $this-distance = 0;
        }

    return $closest-distance // -1;
    }