Question

我的代码适用于某些pdf，但有些显示错误：

Traceback (most recent call last):
  File "con.py", line 24, in <module>
    print getPDFContent("abc.pdf")
  File "con.py", line 17, in getPDFContent
    f.write(a)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u02dd' in position 64: ordinal not in range(128)

我的代码是

import pyPdf

def getPDFContent(path):

    content = ""

    pdf = pyPdf.PdfFileReader(file(path, "rb"))

    for i in range(0, pdf.getNumPages()):
        f=open("xxx.txt",'a')
        content= pdf.getPage(i).extractText() + "\n"
        import string
        c=content.split()
        for a in c:
            f.write(" ")
            f.write(a)
        f.write('\n')
        f.close()

    return content

print getPDFContent("abc.pdf")

Answer 1

尝试

import sys
print getPDFContent("abc.pdf").encode(sys.getfilesystemencoding())

Answer 2

您的问题是，当您使用字符串调用f.write()时，它会尝试使用ascii编解码器对其进行编码。您的pdf包含ascii编解码器无法表示的字符。尝试明确编码str，例如

a = a.encode('utf-8')
f.write(a)

在Python中将pdf转换为文本文件

2 个答案: