Perl内部表示unicode字符串

时间:2015-05-27 09:14:07

标签: perl unicode encoding mojolicious

我正在开发一个perl + Mojolicious Web应用程序,我的前端使用charset "a""été"参数(utf-8)中发送包含重音的POST查询,因为我可以窥探在chrome网络选项卡中。但服务器端脚本使用我没想到的字符集解码该参数。 我编写了以下脚本来重现这种情况。

use utf8; #script encoded in utf8 without bom
use Mojolicious::Lite; 
use Data::HexDump;
{
    require Mojolicious;
    say "perl $^V, Mojolicious: v", Mojolicious->VERSION, ", ", `chcp` ;
}

post '/' => sub{
        my $self = shift;
        my $params = $self->req->params->to_hash;
        app->log->debug("received data:\n", HexDump( $params->{a} ) );
        use Devel::Peek;
        Dump( $params->{a} );
        $self->render( text => "ok for '$params->{a}'" );
    };

if(my $pid = fork()){
    use Mojo::UserAgent;
    my $t = Mojo::UserAgent->new;
    #simulate front-end query
    my $tx  = $t->post('http://127.0.0.1:3042/' => 
                            { 'Content-Type' => 'application/x-www-form-urlencoded; charset=UTF-8' }, 
                            form => {  a => 'été'} 
                        );
    my $res = $tx->res->body;
    say "result:\n", HexDump($res);
    use Devel::Peek;
    Dump( $res );
    kill 'SIGKILL', $pid;
    exit(0);
}

app->start(qw(daemon --listen http://*:3042 ));

此脚本的输出是:

perl v5.20.1, Mojolicious: v6.05, Page de codes active : 850

[Tue May 26 12:31:15 2015] [info] Listening at "http://*:3042"
Server available at http://127.0.0.1:3042
[Tue May 26 12:31:16 2015] [debug] Your secret passphrase needs to be changed
[Tue May 26 12:31:16 2015] [debug] POST "/"
[Tue May 26 12:31:16 2015] [debug] Routing to a callback
[Tue May 26 12:31:16 2015] [debug] received data:

          00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F  0123456789ABCDEF

00000000  E9 74 E9                                           .t.

SV = PVMG(0x5a7a198) at 0x4dce730
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x5b62c48 "\303\251t\303\251"\0 [UTF8 "\x{e9}t\x{e9}"]
  CUR = 5
  LEN = 10
[Tue May 26 12:31:16 2015] [debug] 200 OK (0.005052s, 197.941/s)
result:
          00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F  0123456789ABCDEF

00000000  6F 6B 20 66 6F 72 20 27 - C3 A9 74 C3 A9 27        ok for '..t..'

SV = PV(0x41a73e8) at 0x4927070
  REFCNT = 1
  FLAGS = (PADMY,POK,IsCOW,pPOK)
  PV = 0x5aa1328 "ok for '\303\251t\303\251'"\0
  CUR = 14
  LEN = 16
  COW_REFCNT = 1

因此,我们可以看到服务器在标记为"a"的字符串中收到utf8参数,其中包含缓冲区"\x{e9}t\x{e9}"

我期待"été"使用hexa "C3 A9 74 C3 A9"

有什么问题?

2 个答案:

答案 0 :(得分:1)

更新:您的程序没有任何问题,您只是按照自己的意愿获得été,它只是简单地转换为perl unicode字符串"\xE9t\xE9",它们是相同的,perl unicode字符串不存储在内存中作为utf8,它们从utf解码为unicode代码点/序数,utf8只是一种编码/表示unicode代码点/序数的方法 é是序数233,请查看下面的维基百科链接(也是更新的程序)

嗯,été在utf8中仅为C3 A9 74 C3 A9,数字/序数été为233 116 233

作为perl unicode字符串\xE9t\xE9,数字233是十六进制的E9

更新:在我用编辑器创建utf8文件2之前,这里用perl创建。您可以看到它获得了您期望的正确字节数,并且当您将其视为utf或raw时,可以看出差异

$ perl -CS -e " print chr(233), chr(116), chr(233) " >2

$ od -tx1 2
0000000 c3 a9 74 c3 a9
0000005

$ type 2
été
$
$ perl -MData::Dump -MPath::Tiny -e " dd ( path(2)->slurp_raw ) "
"\xC3\xA9t\xC3\xA9"

$ perl -MData::Dump -MPath::Tiny -e " dd ( path(2)->slurp_utf8 ) "
"\xE9t\xE9"

$ perl -MData::Dump -MPath::Tiny -e " dd( map { [ $_, ord$_ ] } split //, path(2)->slurp_utf8 ) "
(["\xE9", 233], ["t", 116], ["\xE9", 233])

答案 1 :(得分:1)

U+00E9é的代码点。 c3 a9是UTF-8编码。要查看'é'的UTF-8编码形式,您需要对其进行UTF-8编码。例如:

#!/usr/bin/env perl -l

use utf8;
use strict;
use warnings;
use Unicode::UTF8 qw( encode_utf8 );

binmode STDOUT, ':encoding(UTF-8)';

my $é = "\x{e9}";

print $é;
printf "%v02x\n", encode_utf8($é);

输出:

$ ./u.pl
é
c3.a9