Question

我需要规范化一个字符串，例如“quée”，我似乎无法将扩展的ASCII字符（例如é，á，í等）转换为罗马/英语版本。我已经尝试了几种不同的方法但到目前为止没有任何工作。这个一般主题有相当多的材料，但我似乎无法找到解决这个问题的方法。

这是我的代码：

#transliteration solution (works great with standard chars but doesn't find the 
#special ones) - I've tried looking for both \x{130} and é with the same result.
$mystring =~ tr/\\x{130}/e/;

#converting into array, then iterating through and replacing the specific char
#( same result as the above solution )
my @breakdown = split( "",$mystring );

foreach ( @breakdown ) {
    if ( $_ eq "\x{130}" ) {
        $_ = "e";
        print "\nArray Output: @breakdown\n";
    }
    $lowercase = join( "",@breakdown );
}

Answer 1

1）这个article应该提供相当好的（如果复杂的）方式。

它提供了将所有带重音的Unicode字符转换为基本字符+重音的解决方案;完成后，您可以单独删除重音字符。

2）另一种选择是CPAN：Text::Unaccent::PurePerl（改进的纯Perl版Text::Unaccent）

3）另外，this SO answer建议Text::Unidecode：

$ perl -Mutf8 -MText::Unidecode -E 'say unidecode("été")'
  ete

Answer 2

原始代码不起作用的原因是\x{130}不是é。这是LATIN CAPITAL LETTER I WITH DOT ABOVE (U+0130 or İ)。您的意思是\x{E9}或仅\xE9（大括号对于两位数字而言是可选的），LATIN SMALL LETTER E WITH ACUTE (U+00E9)。

此外，您的tr还有额外的反斜杠;它应该看起来像tr/\xE9/e/。

通过这些更改，您的代码将会正常工作，尽管我仍然建议您使用CPAN上的其中一个模块进行此类操作。我自己更喜欢Text::Unidecode，因为它处理的不仅仅是重音字符。

Answer 3

工作和重新工作之后，这就是我现在所拥有的。它正在做我想要的一切，除了我想在输入字符串的中间保留空格以区分单词。

open FILE, "funnywords.txt";

# Iterate through funnywords.txt
while ( <FILE> ) {
    chomp;

    # Show initial text from file
    print "In: '$_' -> ";

    my $inputString = $_;

    # $inputString is scoped within a for each loop which dissects
    # unicode characters ( example: "é" splits into "e" and "´" )
    # and throws away accent marks. Also replaces all
    # non-alphanumeric characters with spaces and removes
    # extraneous periods and spaces.
    for ( $inputString ) {
        $inputString = NFD( $inputString ); # decompose/dissect
        s/^\s//; s/\s$//;                   # strip begin/end spaces
        s/\pM//g;                           # strip odd pieces
        s/\W+//g;                           # strip non-word chars
    }

    # Convert to lowercase 
    my $outputString = "\L$inputString";

    # Output final result
    print "$outputString\n";
}

不完全确定为什么它会将一些正则表达式和注释着色为红色......

以下是“funnywords.txt”中几行的例子：

quée

22

？éÉíóñúÑ¿¡

[。这个？ ]

aquí，aLLí

Answer 4

关于删除任何剩余符号的第二个问题，但保留字母和数字会将您的上一个正则结果从s/\W+//g更改为s/[^a-zA-Z0-9 ]+//g。由于您已经对其余输入进行了规范化，因此使用该正则表达式将删除任何非a-z，A-Z，0-9或空格。在开头使用[]和a ^表示您要查找括号其余部分中没有的所有内容。

规范化ASCII字符

4 个答案: