在哪里可以找到特定块的(未)分配的Unicode代码点数组?

时间:2010-05-22 13:52:45

标签: perl unicode

目前,我正在手工编写这些数组。

例如,Miscellaneous Mathematical Symbols-A块有一个哈希条目,如下所示:

my %symbols = (
    ...
    miscellaneous_mathematical_symbols_a => [(0x27C0..0x27CA), 0x27CC,
        (0x27D0..0x27EF)],
    ...
)

更简单的'连续'数组

miscellaneous_mathematical_symbols_a => [0x27C0..0x27EF]

不起作用,因为Unicode块中有漏洞。例如,0x27CB没有任何内容。请查看code chart [PDF]。

手工编写这些数组非常繁琐,容易出错并且有点乐趣。我觉得有人已经在Perl中解决了这个问题!

3 个答案:

答案 0 :(得分:2)

也许这个?

my @list =
    grep {chr ($_) =~ /^\p{Assigned}$/}
    0x27C0..0x27EF;
@list = map { $_ = sprintf ("%X", $_ )} @list;
print "@list\n";

给我

27C0 27C1 27C2 27C3 27C4 27C5 27C6 27C7 27C8 27C9 27CA 27D0 27D1 27D2 27D3 
27D4 27D5 27D6 27D7 27D8 27D9 27DA 27DB 27DC 27DD 27DE 27DF 27E0 27E1 27E2 
27E3 27E4 27E5 27E6 27E7 27E8 27E9 27EA 27EB

答案 1 :(得分:2)

也许你想要Unicode::UCD?使用其charblock例程来获取任何命名块的范围。如果您想获取这些名称,可以使用charblocks

这个模块实际上只是Perl附带的Unicode数据库的接口,所以如果你必须做更好的事情,你可以查看 lib / 5.xy / unicore / UnicodeData.txt 或同一目录中的各种其他文件,以获得您所需的信息。

以下是我创建%symbols时所提出的建议。我浏览了所有的块(虽然在这个示例中我跳过了那些没有“Math”的名称。我得到了起始和结束代码点,并检查分配了哪些。从那里,我创建了一个我可以使用的自定义属性检查一个字符是否在范围内并分配。

use strict;
use warnings;

digest_blocks();

my $property = 'My::InMiscellaneousMathematicalSymbolsA';

foreach ( 0x27BA..0x27F3 )
    {
    my $in = chr =~ m/\p{$property}/;

    printf "%X is %sin $property\n",
        $_, $in ? '' : ' not ';
    }


sub digest_blocks {
    use Unicode::UCD qw(charblocks);

    my $blocks = charblocks();

    foreach my $block ( keys %$blocks )
        {
        next unless $block =~ /Math/; # just to make the output small

        my( $start, $stop ) = @{ $blocks->{$block}[0] };

        $blocks->{$block} = {
            assigned   => [ grep { chr =~ /\A\p{Assigned}\z/ } $start .. $stop ],
            unassigned => [ grep { chr !~ /\A\p{Assigned}\z/ } $start .. $stop ],
            start      => $start,
            stop       => $stop,
            name       => $block,
            };

        define_my_property( $blocks->{$block} );
        }
    }

sub define_my_property {
    my $block = shift;

    (my $subname = $block->{name}) =~ s/\W//g;
    $block->{my_property} = "My::In$subname"; # needs In or Is

    no strict 'refs';
    my $string = join "\n", # can do ranges here too
        map { sprintf "%X", $_ } 
        @{ $block->{assigned} };

    *{"My::In$subname"} = sub { $string };
    }

如果我要做很多事情,我会使用相同的东西来创建一个已经定义了自定义属性的Perl源文件,这样我就可以在我的任何工作中立即使用它们。在更新Unicode数据之前,所有数据都不应更改。

sub define_my_property {
    my $block = shift;

    (my $subname = $block->{name}) =~ s/\W//g;
    $block->{my_property} = "My::In$subname"; # needs In or Is

    no strict 'refs';
    my $string = num2range( @{ $block->{assigned} } );

    print <<"HERE";
sub My::In$subname {
    return <<'CODEPOINTS';
$string
CODEPOINTS
    }

HERE
    }

# http://www.perlmonks.org/?node_id=87538
sub num2range {
  local $_ = join ',' => sort { $a <=> $b } @_;
  s/(?<!\d)(\d+)(?:,((??{$++1})))+(?!\d)/$1\t$+/g;
  s/(\d+)/ sprintf "%X", $1/eg;
  s/,/\n/g;
  return $_;
}

这给了我适合Perl库的输出:

sub My::InMiscellaneousMathematicalSymbolsA {
    return <<'CODEPOINTS';
27C0    27CA
27CC
27D0    27EF
CODEPOINTS
    }

sub My::InSupplementalMathematicalOperators {
    return <<'CODEPOINTS';
2A00    2AFF
CODEPOINTS
    }

sub My::InMathematicalAlphanumericSymbols {
    return <<'CODEPOINTS';
1D400   1D454
1D456   1D49C
1D49E   1D49F
1D4A2
1D4A5   1D4A6
1D4A9   1D4AC
1D4AE   1D4B9
1D4BB
1D4BD   1D4C3
1D4C5   1D505
1D507   1D50A
1D50D   1D514
1D516   1D51C
1D51E   1D539
1D53B   1D53E
1D540   1D544
1D546
1D54A   1D550
1D552   1D6A5
1D6A8   1D7CB
1D7CE   1D7FF
CODEPOINTS
    }

sub My::InMiscellaneousMathematicalSymbolsB {
    return <<'CODEPOINTS';
2980    29FF
CODEPOINTS
    }

sub My::InMathematicalOperators {
    return <<'CODEPOINTS';
2200    22FF
CODEPOINTS
    }

答案 2 :(得分:-2)

我不知道为什么你不会说miscellaneous_mathematical_symbols_a => [0x27C0..0x27EF],因为这就是根据PDF定义Unicode标准的方式。

当你说它“不起作用”时,你是什么意思?如果在检查块中字符的存在时它会给你一些错误,那么当你的检查器遇到错误时,为什么不将它们从块中清除掉呢?