使用pypdf2解析PDF文件时出现此错误 我随错误附上PDF。
I have attached the PDF to be parsed please click to view
有人可以帮忙吗?
import PyPDF2
def convert(data):
pdfName = data
read_pdf = PyPDF2.PdfFileReader(pdfName)
page = read_pdf.getPage(0)
page_content = page.extractText()
print(page_content)
return (page_content)
错误:
PyPDF2.utils.PdfReadError: Expected object ID (8 0) does not match actual (7 0); xref table not zero-indexed.
答案 0 :(得分:0)
有一些开源的OCR工具,例如tesseract或openCV。
如果您想使用例如tesseract有一个名为pytesseract的python包装器库。
大多数OCR工具都可以处理图像,因此您必须首先将PDF转换为图像文件格式,例如PNG或JPG。之后,您可以加载图像并使用pytesseract处理它。
以下是一些示例代码,您可以使用pytesseract,假设您已经将PDF转换为文件名为pdfName.png
的图像:
from PIL import Image
import pytesseract
def ocr_core(filename):
"""
This function will handle the core OCR processing of images.
"""
text = pytesseract.image_to_string(Image.open(filename)) # We'll use Pillow's Image class to open the image and pytesseract to detect the string in the image
return text
print(ocr_core('pdfName.png'))
答案 1 :(得分:0)
如果可以使用 online OCR,则可以使用免费的OCR API至create searchable PDF(三明治PDF)。
使用该API的Python代码:
import requests
def ocr_space_file(filename, overlay=False, api_key='helloworld', language='eng'):
""" OCR.space API request with local file.
Python3.5 - not tested on 2.7
:param filename: Your file path & name.
:param overlay: Is OCR.space overlay required in your response.
Defaults to False.
:param api_key: OCR.space API key.
Defaults to 'helloworld'.
:param language: Language code to be used in OCR.
List of available language codes can be found on https://ocr.space/OCRAPI
Defaults to 'en'.
:return: Result in JSON format.
"""
payload = {'isOverlayRequired': overlay,
'apikey': api_key,
'language': language,
}
with open(filename, 'rb') as f:
r = requests.post('https://api.ocr.space/parse/image',
files={filename: f},
data=payload,
)
return r.content.decode()
def ocr_space_url(url, overlay=False, api_key='helloworld', language='eng'):
""" OCR.space API request with remote file.
Python3.5 - not tested on 2.7
:param url: Image url.
:param overlay: Is OCR.space overlay required in your response.
Defaults to False.
:param api_key: OCR.space API key.
Defaults to 'helloworld'.
:param language: Language code to be used in OCR.
List of available language codes can be found on https://ocr.space/OCRAPI
Defaults to 'en'.
:return: Result in JSON format.
"""
payload = {'url': url,
'isOverlayRequired': overlay,
'apikey': api_key,
'language': language,
}
r = requests.post('https://api.ocr.space/parse/image',
data=payload,
)
return r.content.decode()
# Use examples:
test_file = ocr_space_file(filename='example_image.png', language='pol')
test_url = ocr_space_url(url='http://i.imgur.com/31L5y.jpg')