Question

对于文件中的所有行（大约30000），我想找到开头的字符数当前行与上一行相同。例如输入：

#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/

我期待：

0   #to
3   #top
0   /0linyier
1   /10000001659/item/1097859586891251/
19  /10000001659/item/1191085827568626/
6   /10000121381/item/890759920974460/
7   /10000154478/item/1118425481552267/
3   /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2   /1175332/item/10150825241495757/
1   /806123/item/10210653847881125/
1   /51927642128/item/488930816844251927642128/341878905879428/

我试图通过将字符串解压缩到字符并计算直到第一次不匹配来在perl中工作但是我想知道是否存在使用awk或{{1的内置函数的一些不太慢的方法}}

更新：我已将我的尝试添加为答案。

Answer 1

像这样，也许？

用Perl编写

use strict;
use warnings 'all';

my $prev = "";

while ( my $line = <DATA> ) {

    chomp $line;

    my $max = 0;
    ++$max until $max > length($line) or substr($prev, 0, $max) ne substr($line, 0, $max);

    printf "%-2d  %s\n", $max-1, $line;

    $prev = $line;
}

__DATA__
#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/

输出

0   #to
3   #top
0   /0linyier
1   /10000001659/item/1097859586891251/
19  /10000001659/item/1191085827568626/
6   /10000121381/item/890759920974460/
7   /10000154478/item/1118425481552267/
3   /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2   /1175332/item/10150825241495757/
1   /806123/item/10210653847881125/
1   /51927642128/item/488930816844251927642128/341878905879428/[Finished in 0.1s]

Answer 2

没有内置功能可以为你做到这一点，而不是一次只能使用1个字符，你可以在一种二进制搜索中一次比较每个字符串的一半，类似于（半as的awk伪 - 代码）：

prev     = curr
lgthPrev = lgthCurr
curr     = $0
lgthCurr = length(curr)
partLgth = (lgthPrev > lgthCurr ? lgthCurr : lgthPrev)
while ( got strings to work with ) {
    partCurr = substr(curr,1,partLgth)
    partPrev = substr(prev,1,partLgth)
    if ( partCurr == partPrev ) {
        # add on half of the rest of each string and try again
        partLgth = partLgth * 1.5
    }
    else {
        # subtract half of these strings and try again
        partLgth = partLgth * 0.5
    }
}

当你没有更多的子字符串要比较时退出上面的循环，并且在那时结果是：

2个子串在前一次迭代中匹配，以便前一个字符串长度是匹配子字符串的最大长度，或
2个子串从不匹配，因此2个字符串之间没有部分匹配。

这将使用比char-by-char比较可能少得多的迭代，但正如所写，它在每次迭代时都进行字符串而不是字符比较，所以idk是净性能结果。你可以通过在每次迭代时首先进行字符而不是字符串比较来加速它，如果字符在当前位置匹配则只进行字符串比较：

prev     = curr
lgthPrev = lgthCurr
curr     = $0
lgthCurr = length(curr)
partLgth = (lgthPrev > lgthCurr ? lgthCurr : lgthPrev)
while ( got strings to work with ) {
    if ( substr(curr,partLgth,1) == substr(prev,partLgth,1) )
        isMatch = (substr(curr,1,partLgth) == substr(prev,1,partLgth) ? 1 : 0)
    }
    else {
        isMatch = 0
    }
    if ( isMatch ) 
        # add on half of the rest of each string and try again
        partLgth = partLgth * 1.5
    }
    else {
        # subtract half of these strings and try again
        partLgth = partLgth * 0.5
    }
}

Answer 3

使用gawk

awk -v FS="" 'p{
    pl=0; 
    split(p,a,r); 
    for(i=1;i in a; i++)
          if(a[i]==$i){ pl++ }else { break }
}
{ 
   print pl+0,$0; p=$0
}' file

或

awk -v FS="" 'p{ pl=0; for(i=1;i<=NF; i++) if(substr(p,i,1)==$i){ pl++ }else { break } } { print pl+0,$0; p=$0 }' file

<强>输入

$ cat file #to #top /0linyier /10000001659/item/1097859586891251/ /10000001659/item/1191085827568626/ /10000121381/item/890759920974460/ /10000154478/item/1118425481552267/ /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3 /1175332/item/10150825241495757/ /806123/item/10210653847881125/ /51927642128/item/488930816844251927642128/341878905879428/

<强>输出

$ awk -v FS="" 'p{pl=0; split(p,a,r); for(i=1;i in a; i++)if(a[i]==$i){ pl++ }else { break }}{ print pl+0,$0; p=$0}' file 0 #to 3 #top 0 /0linyier 1 /10000001659/item/1097859586891251/ 19 /10000001659/item/1191085827568626/ 6 /10000121381/item/890759920974460/ 7 /10000154478/item/1118425481552267/ 3 /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3 2 /1175332/item/10150825241495757/ 1 /806123/item/10210653847881125/ 1 /51927642128/item/488930816844251927642128/341878905879428/

解释

awk -v FS="" ' # call awk set field sep="" p{ pl=0; # reset variable pl split(p,a,r); # split variable p for(i=1;i in a; i++) # loop through array if(a[i]==$i){ # check array element with current field pl++ # if matched then increment pl }else { break # else its over break loop } } { print pl+0,$0; # print count, and current record p=$0 # store current record in variable p } ' file

请注意，如果将空字符串分配给FS，标准会指出结果未指定。某些版本的awk将在您的示例中生成上面显示的输出。 awk上的OS/X版本会发出警告和输出。

awk: field separator FS is empty

因此，将FS设置为空字符串的特殊含义在每个awk中都不起作用。

Answer 4

perl脚本：

#!/usr/bin/perl -ln
$c = [ unpack "C*" ]; #current record
$i = 0;
$i++ while $p->[$i] == $c->[$i]; # count till mismatch
print "$i $_";
$p = $c               #save current record for next time

没有命令行标志的同样的事情：

#!/usr/bin/perl
while (<>) {
    chomp;
    $c = [ unpack "C*" ];
    $i = 0;
    $i++ while $p->[$i] == $c->[$i];
    print "$i $_\n";
    $p = $c
}

与单行相同：

perl -lne '$c=[unpack "C*"]; $i=0; $i++ while $p->[$i] == $c->[$i]; print "$i $_"; $p = $c'

将包含这些行的文件作为参数传递，或将数据传递给命令。

根据我的实际数据，其运行速度与Borodin's solution：

一样快

$ xzcat href.xz |wc -l
33150
$ time xzcat href.xz | ./borodin.pl >borodin.out

real    0m2.437s
user    0m2.684s
sys     0m0.080s
$ time xzcat href.xz | ./pk.pl > pk.out 

real    0m2.305s
user    0m2.564s
sys     0m0.088s
$ diff pk.out borodin.out

Answer 5

在awk中：

$ awk -F '' '{n=split(p,a,"");for(i=1;i<=(NF<n?NF:n)&&a[i]==$i;i++);print --i,$0; p=$0}' file
0 #to
3 #top
0 /0linyier
1 /10000001659/item/1097859586891251/
19 /10000001659/item/1191085827568626/
6 /10000121381/item/890759920974460/
7 /10000154478/item/1118425481552267/
3 /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2 /1175332/item/10150825241495757/
1 /806123/item/10210653847881125/
1 /51927642128/item/488930816844251927642128/341878905879428/

说明：

awk -F '' '{                                # each char on its own field
    n=split(p,a,"")                         # split prev record p each char in own a cell
    for(i=1;i<=(NF<n?NF:n)&&a[i]==$i;i++);  # compare while $i == a[i]
    print --i,$0                            # print comparison count (--fix)
    p=$0                                    # store record to p(revious)
}' file

Answer 6

您可以直接使用gawk进行操作。在这里，它只是将当前行与前一行进行比较，并计算常见前导字符的数量：

BEGIN{
    prev="";
}
{
    curr=$1;
    n = length(curr);
    m = length(prev);
    s = n<m?n:m;
    cnt = 0;
    for(i = 1;i <= s;i++){
        if(substr(curr, i, 1) == substr(prev, i, 1)){
            cnt++;
        }else{
            break;
        }
    }
    print(cnt, curr);

    prev=curr;
}

在两个字符串中查找公共前缀的长度

6 个答案:

输出