Question

我必须将许多大文件（最多2GB）的EBCDIC 500编码文件转换为Latin-1。由于我只能找到EBCDIC到ASCII转换器（dd，recode），并且文件包含一些额外的专有字符代码，我想我会编写自己的转换器。

我有character mapping因此我对技术方面感兴趣。

到目前为止，这是我的方法：

# char mapping lookup table
EBCDIC_TO_LATIN1 = {
  0xC1:'41', # A
  0xC2:'42', # B
  # and so on...
}

BUFFER_SIZE = 1024 * 64
ebd_file = file(sys.argv[1], 'rb')
latin1_file = file(sys.argv[2], 'wb')

  buffer = ebd_file.read(BUFFER_SIZE)
  while buffer:
    latin1_file.write(ebd2latin1(buffer))
    buffer = ebd_file.read(BUFFER_SIZE)

ebd_file.close()
latin1_file.close()

这是进行转换的功能：

def ebd2latin1(ebcdic):

   result = []
   for ch in ebcdic:
     result.append(EBCDIC_TO_LATIN1[ord(ch)])

   return ''.join(result).decode('hex')

问题在于从工程角度来看这是否是一种合理的方法。它有一些严重的设计问题吗？缓冲区大小可以吗？等等...

至于有些人不相信的“专有字符”：每个文件都包含一年的SGML格式专利文件。专利局一直在使用EBCDIC，直到他们在2005年切换到Unicode。因此，每个文件中有数千个文件。它们由一些不属于任何IBM规范的十六进制值分隔。它们是由专利局添加的。此外，在每个文件的开头有几个ASCII数字，告诉你文件的长度。我真的不需要这些信息，但如果我想处理文件，那么我必须处理它们。

此外：

$ recode IBM500/CR-LF..Latin1 file.ebc recode: file.ebc failed: Ambiguous output in step `CR-LF..data'

感谢您的帮助。

Answer 1

EBCDIC 500，又名Code Page 500，是Pythons编码之一，虽然你链接到cp1047，但没有。你使用哪一个，真的吗？无论如何，这适用于cp500（或您拥有的任何其他编码）。

from __future__ import with_statement
import sys
from contextlib import nested

BUFFER_SIZE = 16384
with nested(open(sys.argv[1], 'rb'), open(sys.argv[2], 'wb')) as (infile, outfile):

    while True:
        buffer = infile.read(BUFFER_SIZE)
        if not buffer:
            break
        outfile.write(buffer.decode('cp500').encode('latin1'))

这样您就不需要自己跟踪映射。

Answer 2

如果您正确设置了表格，那么您只需要执行以下操作：

translated_chars = ebcdic.translate(EBCDIC_TO_LATIN1)

其中ebcdic包含EBCDIC字符，EBCDIC_TO_LATIN1是256-char字符串，它将每个EBCDIC字符映射到其Latin-1等价物。 EBCDIC_TO_LATIN1中的字符是实际的二进制值，而不是它们的十六进制表示。例如，如果您使用的是代码页500，EBCDIC_TO_LATIN1的前16个字节将是

'\x00\x01\x02\x03\x37\x2D\x2E\x2F\x16\x05\x25\x0B\x0C\x0D\x0E\x0F'

使用this reference。

Answer 3

虽然这可能对原版海报不再有帮助，但前段时间我发布了Python 2.6+和3.2+的软件包，它增加了大部分西方8位大型机编解码器，包括CP1047（法语）和CP1141（德语）：{{ 3}}。只需import ebcdic添加编解码器，然后使用open(..., encoding='cp1047')来读取或写入文件。

Answer 4

答案1：

又一个愚蠢的问题：是什么让你觉得recode只产生ASCII作为输出？ AFAICT它将把它的任何chapertoire字符串转码到它的任何曲目，其曲目包括IBM cp500和cp1047，以及OF COURSE latin1。阅读评论，你会注意到Lennaert和我发现这两个IBM字符集中没有任何“专有”代码。所以，一旦你确定你实际拥有什么字符集，你很可能会使用重新编码。

答案2：

如果您确实需要/想要通过Python转码IBM cp1047，您可能希望首先从权威来源获取映射，通过脚本处理它并进行一些检查：

URL = "http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/glibc-IBM1047-2.1.2.ucm"
"""
Sample lines:
<U0000>  \x00 |0
<U0001>  \x01 |0
<U0002>  \x02 |0
<U0003>  \x03 |0
<U0004>  \x37 |0
<U0005>  \x2D |0
"""
import urllib, re
text = urllib.urlopen(URL).read()
regex = r"<U([0-9a-fA-F]{4,4})>\s+\\x([0-9a-fA-F]{2,2})\s"
results = re.findall(regex, text)
wlist = [None] * 256
for result in results:
    unum, inum = [int(x, 16) for x in result]
    assert wlist[inum] is None
    assert 0 <= unum <= 255
    wlist[inum] = chr(unum)
assert not any(x is None for x in wlist)
print repr(''.join(wlist))

然后小心地将输出复制/粘贴到转码脚本中，以便与Vinay的buffer.translate（the_mapping）理念一起使用，缓冲区大小可能略大于16KB，当然小于2GB： - ）

Answer 5

没有水晶球，没有来自OP的信息，所以在EPO网站上有点翻找。找到可免费下载的每周专利信息文件，仍然以cp500 / SGML提供，尽管网站上说这将在2006年被utf8 / XML取代:-)。得到了2009年第27周的文件。是一个包含2个文件的zip文件s350927 [ab] .bin。 “bin”表示“不是XML”。得到了规格！看起来“专有代码”实际上是 BINARY 字段。每条记录都有一个固定的252字节标题。前5个字节是EBCDIC中的记录长度，例如hex F0F2F2F0F8 - ＆gt; 2208个字节。固定标头的最后2个字节是以下变量部分的BINARY长度（冗余）。中间是几个文本字段，两个2字节二进制字段和一个4字节二进制字段。二进制字段是组内的序列号，但我看到的只有1.变量部分是SGML。

示例（来自s350927b.bin的最后一条记录）：

Record number: 7266
pprint of header text and binary slices:
['EPB102055619         TXT00000001',
 1,
 '        20090701200927 08013627.8     EP20090528NN    ',
 1,
 1,
 '                                     T *lots of spaces snipped*']
Edited version of the rather long SGML:
<PATDOC FILE="08013627.8" CY=EP DNUM=2055619 KIND=B1 DATE=20090701 STATUS=N>
*snip*
<B541>DE<B542>Windschutzeinheit für ein Motorrad
<B541>EN<B542>Windshield unit for saddle-ride type vehicle
<B541>FR<B542>Unité pare-brise pour motocyclette</B540>
*snip*
</PATDOC>

没有标题或预告片记录，只有这一种记录格式。

所以：如果OP的年度文件是这样的，我们也许可以帮助他。

更新：上面是“我的时区2点”版本。这里有更多信息：

OP说：“在每个文件的开头，ASCII中有几位数字告诉你文件的长度。” ...将其翻译为“在每个记录的开头 EBCDIC 中有五个数字，告诉您完全 记录的长度“我们有一个（非常模糊）匹配！

以下是文档页面的网址：http://docs.epoline.org/ebd/info.htm
提到的FIRST文件是规范。

以下是每周下载数据页面的网址：http://ebd2.epoline.org/jsp/ebdst35.jsp

观察：我看到的数据是ST.35系列。还可以下载ST.32，它似乎是仅包含SGML内容的并行版本（在“简化的cp437 / 850”中，每行一个标签）。这表明ST.35记录的固定长度标题中的字段可能不是很有趣，因此可以跳过，这将大大简化转码任务。

对于它的价值，这是我的（调查，午夜后写的）代码：
[更新2 ：稍微整理一下代码;没有功能改变]

from pprint import pprint as pp
import sys
from struct import unpack

HDRSZ = 252

T = '>s' # text
H = '>H' # binary 2 bytes
I = '>I' # binary 4 bytes
hdr_defn = [
    6, T,
    38, H,
    40, T,
    94, I,
    98, H,
    100, T,
    251, H, # length of following SGML text
    HDRSZ + 1
    ]
# above positions as per spec, reduce to allow for counting from 1
for i in xrange(0, len(hdr_defn), 2):
    hdr_defn[i] -= 1

def records(fname, output_encoding='latin1', debug=False):
    xlator=''.join(chr(i).decode('cp500').encode(output_encoding, 'replace') for i in range(256))
    # print repr(xlator)
    def xlate(ebcdic):
        return ebcdic.translate(xlator)
        # return ebcdic.decode('cp500') # use this if unicode output desired
    f = open(fname, 'rb')
    recnum = -1
    while True:
        # get header
        buff = f.read(HDRSZ)
        if not buff:
            return # EOF
        recnum += 1
        if debug: print "\nrecnum", recnum
        assert len(buff) == HDRSZ
        recsz = int(xlate(buff[:5]))
        if debug: print "recsz", recsz
        # split remainder of header into text and binary pieces
        fields = []
        for i in xrange(0, len(hdr_defn) - 2, 2):
            ty = hdr_defn[i + 1]
            piece = buff[hdr_defn[i]:hdr_defn[i+2]]
            if ty == T:
                fields.append(xlate(piece))
            else:
                fields.append(unpack(ty, piece)[0])
        if debug: pp(fields)
        sgmlsz = fields.pop()
        if debug: print "sgmlsz: %d; expected: %d - %d = %d" % (sgmlsz, recsz, HDRSZ, recsz - HDRSZ)
        assert sgmlsz == recsz - HDRSZ
        # get sgml part
        sgml = f.read(sgmlsz)
        assert len(sgml) == sgmlsz
        sgml = xlate(sgml)
        if debug: print "sgml", sgml
        yield recnum, fields, sgml

if __name__ == "__main__":
    maxrecs = int(sys.argv[1]) # dumping out the last `maxrecs` records in the file
    fname = sys.argv[2]
    keep = [None] * maxrecs
    for recnum, fields, sgml in records(fname):
        # do something useful here
        keep[recnum % maxrecs] = (recnum, fields, sgml)
    keep.sort()
    for k in keep:
        if k:
            recnum, fields, sgml = k
            print
            print recnum
            pp(fields)
            print sgml

Answer 6

假设cp500包含所有“额外的专有字符”，使用codecs模块基于Lennart答案的更简洁版本：

import sys, codecs
BUFFER_SIZE = 64*1024

ebd_file = codecs.open(sys.argv[1], 'r', 'cp500')
latin1_file = codecs.open(sys.argv[2], 'w', 'latin1')

buffer = ebd_file.read(BUFFER_SIZE)
while buffer:
    latin1_file.write(buffer)
    buffer = ebd_file.read(BUFFER_SIZE)

ebd_file.close()
latin1_file.close()

对于EBCDIC（CP500）到Latin-1转换器，这是一种合理的方法吗？

6 个答案: