如何在perl中匹配字符串与变音符号?

时间:2011-09-15 11:21:32

标签: regex perl unicode collation

例如,匹配“”Îñţérñåţîöñåļîžåţîöñ“中没有额外模块的”Nation“。是否可以在新的Perl版本(5.14,5.15等)中使用?

  

我找到了答案!感谢 tchrist

UCA匹配的Rigth解决方案(thnx到https://stackoverflow.com/users/471272/tchrist)。

# found start/end offsets for matched utf-substring (without intersections)
use 5.014;
use strict; 
use warnings;
use utf8;
use Unicode::Collate;
binmode STDOUT, ':encoding(UTF-8)';
my $str  = "Îñţérñåţîöñåļîžåţîöñ" x 2;
my $look = "Nation";
my $Collator = Unicode::Collate->new(
    normalization => undef, level => 1
   );

my @match = $Collator->match($str, $look);
if (@match) {
    my $found = $match[0];
    my $f_len  = length($found);
    say "match result: $found (length is $f_len)"; 
    my $offset = 0;
    while ((my $start = index($str, $found, $offset)) != -1) {                                                  
        my $end   = $start + $f_len;
        say sprintf("found at: %s,%s", $start, $end);
        $offset = $end + 1;
    }
}

来自http://www.perlmonks.org/?node_id=485681

的错误(但有效)解决方案
  

魔法代码是:

    $str = Unicode::Normalize::NFD($str); $str =~ s/\pM//g;
  

代码示例:

    use 5.014;
    use utf8;
    use Unicode::Normalize;

    binmode STDOUT, ':encoding(UTF-8)';
    my $str  = "Îñţérñåţîöñåļîžåţîöñ";
    my $look = "Nation";
    say "before: $str\n";
    $str = NFD($str);
    # M is short alias for \p{Mark} (http://perldoc.perl.org/perluniprops.html)
    $str =~ s/\pM//og; # remove "marks"
    say "after: $str";¬
    say "is_match: ", $str =~ /$look/i || 0;

2 个答案:

答案 0 :(得分:7)

使用UCA的正确解决方案(thnx to tchrist ):

# found start/end offsets for matched s
use 5.014;
use utf8;
use Unicode::Collate;
binmode STDOUT, ':encoding(UTF-8)';
my $str  = "Îñţérñåţîöñåļîžåţîöñ" x 2;
my $look = "Nation";
my $Collator = Unicode::Collate->new(
    normalization => undef, level => 1
   );

my @match = $Collator->match($str, $look);
say "match ok!" if @match;

P.S。 “假设你可以删除变音符号以获得基本ASCII字母的代码是邪恶的,仍然存在,破坏,脑损坏,错误以及死刑的理由。” © tchrist Why does modern Perl avoid UTF-8 by default?

答案 1 :(得分:6)

“没有额外的模块”是什么意思?

以下是use Unicode::Normalize; see on perl doc

的解决方案

我从你的字符串中删除了“ţ”和“ļ”,我的日食不想用它们保存脚本。

use strict;
use warnings;
use UTF8;
use Unicode::Normalize;

my $str = "Îñtérñåtîöñålîžåtîöñ";

for ( $str ) {  # the variable we work on
   ##  convert to Unicode first
   ##  if your data comes in Latin-1, then uncomment:
   #$_ = Encode::decode( 'iso-8859-1', $_ );  
   $_ = NFD( $_ );   ##  decompose
   s/\pM//g;         ##  strip combining characters
   s/[^\0-\x80]//g;  ##  clear everything else
 }

if ($str =~ /nation/) {
  print $str . "\n";
}

输出

  

国际化

“ž”从字符串中删除,似乎不是一个组合字符。

for循环的代码来自此方How to remove diacritic marks from characters

另一个有趣的读物是来自Joel Spolsky的The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

<强>更新

正如@tchrist所指出的,存在一种更适合的算法,称为UCA(Unicode校对算法)。 @nordicdyno,已在他的问题中提供了一个实现。

此处描述了算法Unicode Technical Standard #10, Unicode Collation Algorithm

perl模块在perldoc.perl.org

中描述