Question

使用pypdf2解析PDF文件时出现此错误我随错误附上PDF。

I have attached the PDF to be parsed please click to view

有人可以帮忙吗？

import PyPDF2


def convert(data):

   pdfName = data
   read_pdf = PyPDF2.PdfFileReader(pdfName)
   page = read_pdf.getPage(0)
   page_content = page.extractText()
   print(page_content)
   return (page_content)

错误：

PyPDF2.utils.PdfReadError: Expected object ID (8 0) does not match actual (7 0); xref table not zero-indexed.

Answer 1

有一些开源的OCR工具，例如tesseract或openCV。

如果您想使用例如tesseract有一个名为pytesseract的python包装器库。

大多数OCR工具都可以处理图像，因此您必须首先将PDF转换为图像文件格式，例如PNG或JPG。之后，您可以加载图像并使用pytesseract处理它。

以下是一些示例代码，您可以使用pytesseract，假设您已经将PDF转换为文件名为pdfName.png的图像：

from PIL import Image 
import pytesseract

def ocr_core(filename):  
    """
    This function will handle the core OCR processing of images.
    """
    text = pytesseract.image_to_string(Image.open(filename))  # We'll use Pillow's Image class to open the image and pytesseract to detect the string in the image
    return text

print(ocr_core('pdfName.png'))

Answer 2

如果可以使用 online OCR，则可以使用免费的OCR API至create searchable PDF（三明治PDF）。

使用该API的Python代码：

import requests
    def ocr_space_file(filename, overlay=False, api_key='helloworld', language='eng'):
        """ OCR.space API request with local file.
            Python3.5 - not tested on 2.7
        :param filename: Your file path & name.
        :param overlay: Is OCR.space overlay required in your response.
                        Defaults to False.
        :param api_key: OCR.space API key.
                        Defaults to 'helloworld'.
        :param language: Language code to be used in OCR.
                        List of available language codes can be found on https://ocr.space/OCRAPI
                        Defaults to 'en'.
        :return: Result in JSON format.
        """

        payload = {'isOverlayRequired': overlay,
                   'apikey': api_key,
                   'language': language,
                   }
        with open(filename, 'rb') as f:
            r = requests.post('https://api.ocr.space/parse/image',
                              files={filename: f},
                              data=payload,
                              )
        return r.content.decode()


    def ocr_space_url(url, overlay=False, api_key='helloworld', language='eng'):
        """ OCR.space API request with remote file.
            Python3.5 - not tested on 2.7
        :param url: Image url.
        :param overlay: Is OCR.space overlay required in your response.
                        Defaults to False.
        :param api_key: OCR.space API key.
                        Defaults to 'helloworld'.
        :param language: Language code to be used in OCR.
                        List of available language codes can be found on https://ocr.space/OCRAPI
                        Defaults to 'en'.
        :return: Result in JSON format.
        """

        payload = {'url': url,
                   'isOverlayRequired': overlay,
                   'apikey': api_key,
                   'language': language,
                   }
        r = requests.post('https://api.ocr.space/parse/image',
                          data=payload,
                          )
        return r.content.decode()


    # Use examples:
    test_file = ocr_space_file(filename='example_image.png', language='pol')
    test_url = ocr_space_url(url='http://i.imgur.com/31L5y.jpg')

无法将PDF转换为文本格式

2 个答案: