用ASCII转换十六进制十进制表示

时间:2014-05-09 11:54:36

标签: python pdf

我正在尝试用pdf文件中的ASCII表示替换十六进制表示(#..)

import re
with open("C:\\Users\\Suleiman JK\\Desktop\\test\\hello-world-malformed.pdf","rb") as file1:
    stuff = file1.read()
stuff = re.sub("#([0-9A-Fa-f]{2})",lambda m:unichr(int(m.groups()[0],16)),stuff)
with open("C:\\Users\\Suleiman JK\\Desktop\\test\\hello-world-malformed.pdf","wb") as file1:
    file1.write(stuff)
file1 = open("C:\\Users\\Suleiman JK\\Desktop\\test\\hello-world-malformed.pdf")
print file1.read()

当我使用“Geany”运行它时,它给出了以下错误:

Traceback (most recent call last):
  File "testing.py", line 41, in <module>
    main()
  File "testing.py", line 31, in main
    stuff = re.sub("#([0-9A-Fa-f]{2})",lambda m:unichr(int(m.groups()[0],16)),stuff)
  File "C:\Python27\lib\re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x84 in position 239: ordinal not in range(128)

1 个答案:

答案 0 :(得分:0)

不要使用unichr();它会生成一个包含一个字符的unicode字符串。不要混合使用Unicode字符串和字节字符串(二进制数据),因为这会触发隐式编码或解码。这里隐式解码被触发并失败。

您的代码点限制为0-255,因此简单的chr()将执行:

stuff = re.sub("#([0-9A-Fa-f]{2})", lambda m: chr(int(m.group(0), 16)), stuff)