有没有人认识到这种MARC JSON格式?

时间:2018-03-11 11:58:35

标签: json perl marc

有没有人认出这种格式(请参阅底部粘贴)?它来自Répertoiredevedettes-matière(RVM)。 这两者都不是:

我可以在Perl中编程,也可以发布为https://github.com/LibreCat/Catmandu-MARC/issues/88

我只能使用XS :: JSON破解它,但我不知道如何处理这种奇怪的重音编码(325中显示的一些示例行):

{grave}e
{ring}Z
{ringb}h
{ringb}s
{rlig}a
{rlig}A

这是奇怪的MARC JSON:

{
"rows" : [
{
    "RecordNumber" : "1",
    "Tag" : "LDR",
    "Indicators" : "",
    "Content" : "00533nz   2200205n  4500"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "001",
    "Indicators" : "\"  \"",
    "Content" : "201-0000001"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "005",
    "Indicators" : "\"  \"",
    "Content" : "20121025110000.0"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "008",
    "Indicators" : "\"  \"",
    "Content" : "790704\\nfanvnnbabn\\\\\\\\\\\\\\\\\\\\\\b\\ana\\\\\\\\\\\\"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "016",
    "Indicators" : "\\\\",
    "Content" : "$a0509B3366"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "035",
    "Indicators" : "\\\\",
    "Content" : "$a(ISM)8013850"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "035",
    "Indicators" : "9\\",
    "Content" : "$a201-0000001"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "040",
    "Indicators" : "\\\\",
    "Content" : "$aCaQQLa$bfre"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "150",
    "Indicators" : "\\\\",
    "Content" : "$aAlg{grave}ebres de Von Neumann"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "450",
    "Indicators" : "\\\\",
    "Content" : "$wnne$aVon Neumann, Alg{grave}ebres de"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "450",
    "Indicators" : "\\\\",
    "Content" : "$aW*-alg{grave}ebres"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "550",
    "Indicators" : "\\\\",
    "Content" : "$wg$aC*-alg{grave}ebres"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "550",
    "Indicators" : "\\\\",
    "Content" : "$wg$aEspace de Hilbert"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "697",
    "Indicators" : "\\\\",
    "Content" : "$amm."
}
,
{
    "RecordNumber" : "1",
    "Tag" : "750",
    "Indicators" : "\\7",
    "Content" : "$aVon Neumann, Alg{grave}ebres de$2ram"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "750",
    "Indicators" : "\\0",
    "Content" : "$aVon Neumann algebras"
}
]
}

ADDED:此重音编码来自MARCmkr。我使用了以下内容:

use MARC::File::MARCMaker; # https://metacpan.org/pod/MARC::File::MARCMaker
# for some reason can't be found by module name, so use:
# cpanm http://www.cpan.org/authors/id/E/EI/EIJABB/MARC-File-MARCMaker-0.05.tar.gz
my $marc_charset = MARC::File::MARCMaker::usmarc_default();
$content = MARC::File::MARCMaker::_maker2char ($content, $marc_charset);

但是当我在这个文本https://github.com/gmcharlt/marc-perl/blob/e8e0ecc92946d6dcb3c2270706041a30eff0f68d/marc-marcmaker/t/marcmaker.t#L92上测试它时,它只是将重音符/连字词转换为XML实体。我尝试在浏览器中打开翻译的文本:某些实体未被解释,并且没有任何实体重音下一个字符。所以我想我现在需要使用一些“XML to Unicode”模块来完成翻译

This a test of diacritics like the uppercase Polish L in
Ł´od´z, the uppercase Scandinavia O in &Ostrok;st, the
uppercase D with crossbar in Đuro, the uppercase Icelandic
thorn in Þann, the uppercase digraph AE in Ægir, the
uppercase digraph OE in Œuvres, the soft sign in
rech&softsign;, the middle dot in col·lecci´o, the musical
flat in F♭, the patent mark in Frizbee®, the plus or minus
sign in ±54%, the uppercase O-hook in B&Ohorn;, the
uppercase U-hook in X&Uhorn;A, the alif in
mas&mlrhring;alah, the ayn in &mllhring;arab, the lowercase
Polish l in Włocław, the lowercase Scandinavian o in
K&ostrok;benhavn, the lowercase d with crossbar in đavola,
the lowercase Icelandic thorn in þann, the lowercase digraph
ae in være, the lowercase digraph oe in cœur, the lowercase
hardsign in s&hardsign;ezd, the Turkish dotless i in masalı,
the British pound sign in £5.95, the lowercase eth in
verður, the lowercase o-hook (with pseudo question mark) in
S&hooka;&ohorn;, the lowercase u-hook in T&uhorn; D&uhorn;c,
the pseudo question mark in c&hooka;ui, the grave accent in
tr`es, the acute accent in d´esir´ee, the circumflex in
cˆote, the tilde in ma˜nana, the macron in T¯okyo, the breve
in russki˘i, the dot above in ˙zaba, the dieresis (umlaut)
in L¨owenbr¨au, the caron (hachek) in ˇcrny, the circle
above (angstrom) in ˚arbok, the ligature first and second
halves in d&llig;i&rlig;ad&llig;i&rlig;a, the high comma off
center in rozdel&rcommaa;ovac, the double acute in
id˝oszaki, the candrabindu (breve with dot above) in
Ali&candra;iev, the cedilla in ¸ca va comme ¸ca, the right
hook in viet˛a, the dot below in te&dotb;da, the double dot
below in &under;k&under;hu&dbldotb;tbah, the circle below in
Sa&dotb;msk&ringb;rta, the double underscore in
&dblunder;Ghulam, the left hook in Lech Wał&commab;esa, the
right cedilla (comma below) in khŗong, the upadhmaniya (half
circle below) in &breveb;humantuˇs, double tilde, first and
second halves in &ldbltil;n&rdbltil;galan, high comma
(centered) in g&commaa;eotermika.

1 个答案:

答案 0 :(得分:1)

这是编码问题。 record leader表示数据是以MARC-8编码的。您的JSON数据应以UTF-8编码。 _maker2char()使用usmarc_default(),它将助记符重音编码映射到MARC-8编码的字符。使用MARC :: Charset将数据转换为UTF-8。这应该有效:

#!/usr/bin/env perl

use 5.014;

use utf8;
use strict;
use autodie;
use warnings;

use MARC::File::MARCMaker;
use MARC::Charset qw(marc8_to_utf8);

my $data = q{This is a test of diacritics like the uppercase Polish L in {Lstrok}{acute}od{acute}z
the uppercase Scandinavia O in {Ostrok}st
the uppercase D with crossbar in {Dstrok}uro
the uppercase Icelandic thorn in {THORN}ann
the uppercase digraph AE in {AElig}gir
the uppercase digraph OE in {OElig}uvres
the soft sign in rech{softsign}
the middle dot in col{middot}lecci{acute}o
the musical flat in F{flat}
the patent mark in Frizbee{reg}
the plus or minus sign in {plusmn}54%
the uppercase O-hook in B{Ohorn}
the uppercase U-hook in X{Uhorn}A
the alif in mas{mlrhring}alah
the ayn in {mllhring}arab
the lowercase Polish l in W{lstrok}oc{lstrok}aw
the lowercase Scandinavian o in K{ostrok}benhavn
the lowercase d with crossbar in {dstrok}avola
the lowercase Icelandic thorn in {thorn}ann
the lowercase digraph ae in v{aelig}re
the lowercase digraph oe in c{oelig}ur
the lowercase hardsign in s{hardsign}ezd
the Turkish dotless i in masal{inodot}
the British pound sign in {pound}5.95
the lowercase eth in ver{eth}ur
the lowercase o-hook (with pseudo question mark) in S{hooka}{ohorn}
the lowercase u-hook in T{uhorn} D{uhorn}c
the pseudo question mark in c{hooka}ui
the grave accent in tr{grave}es
the acute accent in d{acute}esir{acute}ee
the circumflex in c{circ}ote
the tilde in ma{tilde}nana
the macron in T{macr}okyo
the breve in russki{breve}i
the dot above in {dot}zaba
the dieresis (umlaut) in L{uml}owenbr{uml}au
the caron (hachek) in {caron}crny
the circle above (angstrom) in {ring}arbok
the ligature first and second halves in d{llig}i{rlig}ad{llig}i{rlig}a
the high comma off center in rozdel{rcommaa}ovac
the double acute in id{dblac}oszaki
the candrabindu (breve with dot above) in Ali{candra}iev
the cedilla in {cedil}ca va comme {cedil}ca
the right hook in viet{ogon}a
the dot below in te{dotb}da
the double dot below in {under}k{under}hu{dbldotb}tbah
the circle below in Sa{dotb}msk{ringb}rta
the double underscore in {dblunder}Ghulam
the left hook in Lech Wa{lstrok}{commab}esa
the right cedilla (comma below) in kh{rcedil}ong
the upadhmaniya (half circle below) in {breveb}humantu{caron}s
double tilde
first and second halves in {ldbltil}n{rdbltil}galan
high comma (centered) in g{commaa}eotermika.
Alg{grave}ebres de Von Neumann
{grave}e
{ring}Z
{ringb}h
{ringb}s
{rlig}a
{rlig}A
};

my $marc_charset = MARC::File::MARCMaker::usmarc_default();
my $marc8 = MARC::File::MARCMaker::_maker2char($data, $marc_charset);

# prepare STDOUT for utf8
binmode(STDOUT, 'utf8');

# convert marc8 to utf8
my $utf8 = marc8_to_utf8($marc8);

say $utf8;

输出:

This is a test of diacritics like the uppercase Polish L in Łódź
the uppercase Scandinavia O in Øst
the uppercase D with crossbar in Đuro
the uppercase Icelandic thorn in Þann
the uppercase digraph AE in Ægir
the uppercase digraph OE in Œuvres
the soft sign in rechʹ
the middle dot in col·lecció
the musical flat in F♭
the patent mark in Frizbee®
the plus or minus sign in ±54%
the uppercase O-hook in BƠ
the uppercase U-hook in XƯA
the alif in masʼalah
the ayn in ʻarab
the lowercase Polish l in Włocław
the lowercase Scandinavian o in København
the lowercase d with crossbar in đavola
the lowercase Icelandic thorn in þann
the lowercase digraph ae in være
the lowercase digraph oe in cœur
the lowercase hardsign in sʺezd
the Turkish dotless i in masalı
the British pound sign in £5.95
the lowercase eth in verður
the lowercase o-hook (with pseudo question mark) in Sở
the lowercase u-hook in Tư Dưc
the pseudo question mark in củi
the grave accent in très
the acute accent in désirée
the circumflex in côte
the tilde in mañana
the macron in Tōkyo
the breve in russkiĭ
the dot above in żaba
the dieresis (umlaut) in Löwenbräu
the caron (hachek) in črny
the circle above (angstrom) in årbok
the ligature first and second halves in di͡adi͡a
the high comma off center in rozdelo̕vac
the double acute in időszaki
the candrabindu (breve with dot above) in Alii̐ev
the cedilla in ça va comme ça
the right hook in vietą
the dot below in teḍa
the double dot below in k̲h̲ut̤bah
the circle below in Saṃskr̥ta
the double underscore in G̳hulam
the left hook in Lech Wałe̦sa
the right cedilla (comma below) in kho̜ng
the upadhmaniya (half circle below) in ḫumantuš
double tilde
first and second halves in n͠galan
high comma (centered) in ge̓otermika.
Algèbres de Von Neumann
è
Z̊
h̥
s̥
a
A