无法将PDF转换为文本格式

时间:2019-04-13 19:20:01

标签: python python-3.x python-2.7 pdf-parsing

使用pypdf2解析PDF文件时出现此错误 我随错误附上PDF。

I have attached the PDF to be parsed please click to view

有人可以帮忙吗?

import PyPDF2


def convert(data):

   pdfName = data
   read_pdf = PyPDF2.PdfFileReader(pdfName)
   page = read_pdf.getPage(0)
   page_content = page.extractText()
   print(page_content)
   return (page_content)

错误:

PyPDF2.utils.PdfReadError: Expected object ID (8 0) does not match actual (7 0); xref table not zero-indexed.

2 个答案:

答案 0 :(得分:0)

有一些开源的OCR工具,例如tesseractopenCV

如果您想使用例如tesseract有一个名为pytesseract的python包装器库。

大多数OCR工具都可以处理图像,因此您必须首先将PDF转换为图像文件格式,例如PNG或JPG。之后,您可以加载图像并使用pytesseract处理它。

以下是一些示例代码,您可以使用pytesseract,假设您已经将PDF转换为文件名为pdfName.png的图像:

from PIL import Image 
import pytesseract

def ocr_core(filename):  
    """
    This function will handle the core OCR processing of images.
    """
    text = pytesseract.image_to_string(Image.open(filename))  # We'll use Pillow's Image class to open the image and pytesseract to detect the string in the image
    return text

print(ocr_core('pdfName.png'))  

答案 1 :(得分:0)

如果可以使用 online OCR,则可以使用免费的OCR APIcreate searchable PDF(三明治PDF)。

使用该API的Python代码:

import requests
    def ocr_space_file(filename, overlay=False, api_key='helloworld', language='eng'):
        """ OCR.space API request with local file.
            Python3.5 - not tested on 2.7
        :param filename: Your file path & name.
        :param overlay: Is OCR.space overlay required in your response.
                        Defaults to False.
        :param api_key: OCR.space API key.
                        Defaults to 'helloworld'.
        :param language: Language code to be used in OCR.
                        List of available language codes can be found on https://ocr.space/OCRAPI
                        Defaults to 'en'.
        :return: Result in JSON format.
        """

        payload = {'isOverlayRequired': overlay,
                   'apikey': api_key,
                   'language': language,
                   }
        with open(filename, 'rb') as f:
            r = requests.post('https://api.ocr.space/parse/image',
                              files={filename: f},
                              data=payload,
                              )
        return r.content.decode()


    def ocr_space_url(url, overlay=False, api_key='helloworld', language='eng'):
        """ OCR.space API request with remote file.
            Python3.5 - not tested on 2.7
        :param url: Image url.
        :param overlay: Is OCR.space overlay required in your response.
                        Defaults to False.
        :param api_key: OCR.space API key.
                        Defaults to 'helloworld'.
        :param language: Language code to be used in OCR.
                        List of available language codes can be found on https://ocr.space/OCRAPI
                        Defaults to 'en'.
        :return: Result in JSON format.
        """

        payload = {'url': url,
                   'isOverlayRequired': overlay,
                   'apikey': api_key,
                   'language': language,
                   }
        r = requests.post('https://api.ocr.space/parse/image',
                          data=payload,
                          )
        return r.content.decode()


    # Use examples:
    test_file = ocr_space_file(filename='example_image.png', language='pol')
    test_url = ocr_space_url(url='http://i.imgur.com/31L5y.jpg')