在两个字符串中查找公共前缀的长度

时间:2017-03-13 14:07:09

标签: perl awk command-line

对于文件中的所有行(大约30000),我想找到 开头的字符数 当前行 与上一行相同。 例如输入:

#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/

我期待:

0   #to
3   #top
0   /0linyier
1   /10000001659/item/1097859586891251/
19  /10000001659/item/1191085827568626/
6   /10000121381/item/890759920974460/
7   /10000154478/item/1118425481552267/
3   /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2   /1175332/item/10150825241495757/
1   /806123/item/10210653847881125/
1   /51927642128/item/488930816844251927642128/341878905879428/

我试图通过将字符串解压缩到字符并计算直到第一次不匹配来在perl中工作但是我想知道是否存在使用awk或{{1的内置函数的一些不太慢的方法}}

更新:我已将我的尝试添加为答案。

6 个答案:

答案 0 :(得分:2)

像这样,也许?

用Perl编写

use strict;
use warnings 'all';

my $prev = "";

while ( my $line = <DATA> ) {

    chomp $line;

    my $max = 0;
    ++$max until $max > length($line) or substr($prev, 0, $max) ne substr($line, 0, $max);

    printf "%-2d  %s\n", $max-1, $line;

    $prev = $line;
}

__DATA__
#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/

输出

0   #to
3   #top
0   /0linyier
1   /10000001659/item/1097859586891251/
19  /10000001659/item/1191085827568626/
6   /10000121381/item/890759920974460/
7   /10000154478/item/1118425481552267/
3   /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2   /1175332/item/10150825241495757/
1   /806123/item/10210653847881125/
1   /51927642128/item/488930816844251927642128/341878905879428/[Finished in 0.1s]

答案 1 :(得分:1)

没有内置功能可以为你做到这一点,而不是一次只能使用1个字符,你可以在一种二进制搜索中一次比较每个字符串的一半,类似于(半as的awk伪 - 代码):

prev     = curr
lgthPrev = lgthCurr
curr     = $0
lgthCurr = length(curr)
partLgth = (lgthPrev > lgthCurr ? lgthCurr : lgthPrev)
while ( got strings to work with ) {
    partCurr = substr(curr,1,partLgth)
    partPrev = substr(prev,1,partLgth)
    if ( partCurr == partPrev ) {
        # add on half of the rest of each string and try again
        partLgth = partLgth * 1.5
    }
    else {
        # subtract half of these strings and try again
        partLgth = partLgth * 0.5
    }
}

当你没有更多的子字符串要比较时退出上面的循环,并且在那时结果是:

  1. 2个子串在前一次迭代中匹配,以便 前一个字符串长度是匹配子字符串的最大长度,或
  2. 2个子串从不匹配,因此2个字符串之间没有部分匹配。
  3. 这将使用比char-by-char比较可能少得多的迭代,但正如所写,它在每次迭代时都进行字符串而不是字符比较,所以idk是净性能结果。你可以通过在每次迭代时首先进行字符而不是字符串比较来加速它,如果字符在当前位置匹配则只进行字符串比较:

    prev     = curr
    lgthPrev = lgthCurr
    curr     = $0
    lgthCurr = length(curr)
    partLgth = (lgthPrev > lgthCurr ? lgthCurr : lgthPrev)
    while ( got strings to work with ) {
        if ( substr(curr,partLgth,1) == substr(prev,partLgth,1) )
            isMatch = (substr(curr,1,partLgth) == substr(prev,1,partLgth) ? 1 : 0)
        }
        else {
            isMatch = 0
        }
        if ( isMatch ) 
            # add on half of the rest of each string and try again
            partLgth = partLgth * 1.5
        }
        else {
            # subtract half of these strings and try again
            partLgth = partLgth * 0.5
        }
    }
    

答案 2 :(得分:1)

使用

awk -v FS="" 'p{
    pl=0; 
    split(p,a,r); 
    for(i=1;i in a; i++)
          if(a[i]==$i){ pl++ }else { break }
}
{ 
   print pl+0,$0; p=$0
}' file

awk -v FS="" 'p{
     pl=0; 
     for(i=1;i<=NF; i++)
     if(substr(p,i,1)==$i){ pl++ }else { break }
}
{ 
   print pl+0,$0; p=$0
}' file

<强>输入

$ cat file
#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/

<强>输出

$ awk -v FS="" 'p{pl=0; split(p,a,r); for(i=1;i in a; i++)if(a[i]==$i){ pl++ }else { break }}{ print pl+0,$0; p=$0}' file
0 #to
3 #top
0 /0linyier
1 /10000001659/item/1097859586891251/
19 /10000001659/item/1191085827568626/
6 /10000121381/item/890759920974460/
7 /10000154478/item/1118425481552267/
3 /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2 /1175332/item/10150825241495757/
1 /806123/item/10210653847881125/
1 /51927642128/item/488930816844251927642128/341878905879428/

解释

awk -v FS="" '                                  # call awk set field sep=""
       p{
           pl=0;                                # reset variable pl
           split(p,a,r);                        # split variable p
           for(i=1;i in a; i++)                 # loop through array
                 if(a[i]==$i){                  # check array element with current field
                     pl++                       # if matched then increment pl
                 }else { 
                     break                      # else its over break loop
                 }
        }
        { 
            print pl+0,$0;                      # print count, and current record
            p=$0                                # store current record in variable p
        }
     ' file

请注意,如果将空字符串分配给FS,标准会指出结果未指定。某些版本的awk将在您的示例中生成上面显示的输出。 awk上的OS/X版本会发出警告和输出。

awk: field separator FS is empty

因此,将FS设置为空字符串的特殊含义在每个awk中都不起作用。

答案 3 :(得分:0)

perl脚本:

#!/usr/bin/perl -ln
$c = [ unpack "C*" ]; #current record
$i = 0;
$i++ while $p->[$i] == $c->[$i]; # count till mismatch
print "$i $_";
$p = $c               #save current record for next time

没有命令行标志的同样的事情:

#!/usr/bin/perl
while (<>) {
    chomp;
    $c = [ unpack "C*" ];
    $i = 0;
    $i++ while $p->[$i] == $c->[$i];
    print "$i $_\n";
    $p = $c
}

与单行相同:

perl -lne '$c=[unpack "C*"]; $i=0; $i++ while $p->[$i] == $c->[$i]; print "$i $_"; $p = $c'

将包含这些行的文件作为参数传递,或将数据传递给命令。

根据我的实际数据,其运行速度与Borodin's solution

一样快
$ xzcat href.xz |wc -l
33150
$ time xzcat href.xz | ./borodin.pl >borodin.out

real    0m2.437s
user    0m2.684s
sys     0m0.080s
$ time xzcat href.xz | ./pk.pl > pk.out 

real    0m2.305s
user    0m2.564s
sys     0m0.088s
$ diff pk.out borodin.out 

答案 4 :(得分:0)

在awk中:

$ awk -F '' '{n=split(p,a,"");for(i=1;i<=(NF<n?NF:n)&&a[i]==$i;i++);print --i,$0; p=$0}' file
0 #to
3 #top
0 /0linyier
1 /10000001659/item/1097859586891251/
19 /10000001659/item/1191085827568626/
6 /10000121381/item/890759920974460/
7 /10000154478/item/1118425481552267/
3 /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2 /1175332/item/10150825241495757/
1 /806123/item/10210653847881125/
1 /51927642128/item/488930816844251927642128/341878905879428/

说明:

awk -F '' '{                                # each char on its own field
    n=split(p,a,"")                         # split prev record p each char in own a cell
    for(i=1;i<=(NF<n?NF:n)&&a[i]==$i;i++);  # compare while $i == a[i]
    print --i,$0                            # print comparison count (--fix)
    p=$0                                    # store record to p(revious)
}' file

答案 5 :(得分:-1)

您可以直接使用gawk进行操作。在这里,它只是将当前行与前一行进行比较,并计算常见前导字符的数量:

BEGIN{
    prev="";
}
{
    curr=$1;
    n = length(curr);
    m = length(prev);
    s = n<m?n:m;
    cnt = 0;
    for(i = 1;i <= s;i++){
        if(substr(curr, i, 1) == substr(prev, i, 1)){
            cnt++;
        }else{
            break;
        }
    }
    print(cnt, curr);

    prev=curr;
}