我的代码适用于某些pdf,但有些显示错误:
Traceback (most recent call last):
File "con.py", line 24, in <module>
print getPDFContent("abc.pdf")
File "con.py", line 17, in getPDFContent
f.write(a)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u02dd' in position 64: ordinal not in range(128)
我的代码是
import pyPdf
def getPDFContent(path):
content = ""
pdf = pyPdf.PdfFileReader(file(path, "rb"))
for i in range(0, pdf.getNumPages()):
f=open("xxx.txt",'a')
content= pdf.getPage(i).extractText() + "\n"
import string
c=content.split()
for a in c:
f.write(" ")
f.write(a)
f.write('\n')
f.close()
return content
print getPDFContent("abc.pdf")
答案 0 :(得分:0)
尝试
import sys
print getPDFContent("abc.pdf").encode(sys.getfilesystemencoding())
答案 1 :(得分:0)
您的问题是,当您使用字符串调用f.write()
时,它会尝试使用ascii
编解码器对其进行编码。您的pdf包含ascii
编解码器无法表示的字符。尝试明确编码str
,例如
a = a.encode('utf-8')
f.write(a)