如何从PDF文件中提取文本?

时间:2016-01-17 11:16:53

标签: python pdf

我正在尝试使用Python提取this PDF文件中包含的文字。

我正在使用PyPDF2模块,并拥有以下脚本:

import PyPDF2
pdf_file = open('sample.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content

当我运行代码时,我得到以下输出,该输出与PDF文档中包含的输出不同:

!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%

如何在PDF文档中提取文本?

28 个答案:

答案 0 :(得分:73)

正在寻找一个用于python 3.x和windows的简单解决方案。似乎没有来自textract的支持,这是不幸的,但如果你正在寻找一个简单的解决方案,用于windows / python 3 checkout tika包,真的很直接阅读pdfs

from tika import parser

raw = parser.from_file('sample.pdf')
print(raw['content'])

答案 1 :(得分:40)

使用textract。

它支持多种类型的文件,包括PDF

var videos = [{
    "id": 70111470,
    "title": "Die Hard"
  }, {
    "id": 654356453,
    "title": "Bad Boys"
  }, {
    "id": 65432445,
    "title": "The Chamber"
  }],
  bookmarks = [{
    id: 470,
    time: 23432
  }, {
    id: 453,
    time: 234324
  }, {
    id: 445,
    time: 987834
  }];

// Assuming same size.
function zip(list1, list2) {
  return list1.length ?
    [[list1[0], list2[0]]].concat(zip(list1.slice(1), list2.slice(1))) : [];
}

function makeObj(list) {
  var obj = {};
  obj[list[0].id] = list[1].id;
  return obj;
}

console.log(zip(videos, bookmarks).map(makeObj));

答案 2 :(得分:38)

看看这段代码:

import PyPDF2
pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')

输出结果为:

!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%

使用相同的代码从201308FCR.pdf读取pdf 。输出正常。

documentation解释了原因:

def extractText(self):
    """
    Locate all text drawing commands, in the order they are provided in the
    content stream, and extract the text.  This works well for some PDF
    files, but poorly for others, depending on the generator used.  This will
    be refined in the future.  Do not rely on the order of text coming out of
    this function, as it will change if this function is made more
    sophisticated.
    :return: a unicode string object.
    """

答案 3 :(得分:16)

在尝试textract(似乎有太多的依赖项)和pypdf2(无法从我测试的pdfs中提取文本)和tika(这太慢了)之后我最终使用了来自xpdf的pdftotext(正如在另一个答案中已经建议的那样)并且直接从python中调用二进制文件(您可能需要调整pdftotext的路径):

import os, subprocess
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
args = ["/usr/local/bin/pdftotext",
        '-enc',
        'UTF-8',
        "{}/my-pdf.pdf".format(SCRIPT_DIR),
        '-']
res = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output = res.stdout.decode('utf-8')

pdftotext基本相同,但这假定在/ usr / local / bin中使用pdftotext,而我在AWS lambda中使用它并希望在当前目录中使用它。

顺便说一下:要在lambda上使用它,你需要将二进制和依赖项放到libstdc++.so到你的lambda函数中。我个人需要编译xpdf。由于这方面的说明会破坏这个答案,我把它们on my personal blog

答案 4 :(得分:9)

您可能希望使用经过时间验证的xPDF和派生工具来提取文本,因为pyPDF2似乎还有various issues文本提取仍然存在。

答案很长,有很多变化如何在PDF中编码文本,并且可能需要解码PDF字符串本身,然后可能需要用CMAP映射,然后可能需要分析单词和字母之间的距离等

如果PDF损坏(即显示正确的文本但复制时会产生垃圾)并且您确实需要提取文本,那么您可能需要考虑将PDF转换为图像(使用ImageMagik)然后使用Tesseract使用OCR从图像中获取文本。

答案 5 :(得分:7)

我建议使用pymupdfpdfminer.six

未维护这些软件包:

  • PyPDF2,PyPDF3,PyPDF4
  • pdfminer(不带.six)

如何使用pymupdf阅读纯文本

有不同的选项会产生不同的结果,但最基本的是:

import fitz  # this is pymupdf

with fitz.open("my.pdf") as doc:
    text = ""
    for page in doc:
        text += page.getText()

print(text)

答案 6 :(得分:5)

以下代码是 Python 3 中问题的解决方案。在运行代码之前,请确保已在您的环境中安装了PyPDF2库。如果未安装,请打开命令提示符并运行以下命令:

pip3 install PyPDF2

解决方案代码:

import PyPDF2
pdfFileObject = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())

答案 7 :(得分:5)

可以将多页pdf一次提取为文本,而不必使用下面的代码给出单独的页码作为参数

import PyPDF2
import collections
pdf_file = open('samples.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
c = collections.Counter(range(number_of_pages))
for i in c:
   page = read_pdf.getPage(i)
   page_content = page.extractText()
   print page_content.encode('utf-8')

答案 8 :(得分:4)

在2020年,上述解决方案不适用于我正在使用的特定pdf。以下是窍门。我在Windows 10和Python 3.8上

测试pdf文件:https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing

#pip install pdfminer.six
import io

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    '''Convert pdf content from a file path to text

    :path the file path
    '''
    rsrcmgr = PDFResourceManager()
    codec = 'utf-8'
    laparams = LAParams()

    with io.StringIO() as retstr:
        with TextConverter(rsrcmgr, retstr, codec=codec,
                           laparams=laparams) as device:
            with open(path, 'rb') as fp:
                interpreter = PDFPageInterpreter(rsrcmgr, device)
                password = ""
                maxpages = 0
                caching = True
                pagenos = set()

                for page in PDFPage.get_pages(fp,
                                              pagenos,
                                              maxpages=maxpages,
                                              password=password,
                                              caching=caching,
                                              check_extractable=True):
                    interpreter.process_page(page)

                return retstr.getvalue()


if __name__ == "__main__":
    print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf')) 

答案 9 :(得分:4)

与OCR相比,我有更好的解决方法,并且可以在从PDF中提取文本时保持页面对齐。应该有帮助:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()


    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)


    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

text= convert_pdf_to_txt('test.pdf')
print(text)

答案 10 :(得分:3)

您可以使用PDFtoText https://github.com/jalan/pdftotext

PDF到文本保留文本格式缩进,如果你有表格并不重要。

答案 11 :(得分:3)

在某些情况下,PyPDF2会忽略空白,并使结果文本变成一团糟,但是我使用PyMuPDF,我真的很满意 您可以使用此link获取更多信息

答案 12 :(得分:2)

pdftotext是最好最简单的一个! pdftotext也保留了该结构。

我尝试了PyPDF2,PDFMiner和其他一些方法,但没有一个得到令人满意的结果。

答案 13 :(得分:2)

使用pdfminer.six。这是文档:https://pdfminersix.readthedocs.io/en/latest/index.html

将pdf转换为文本:

    def pdf_to_text():
        from pdfminer.high_level import extract_text

        text = extract_text('test.pdf')
        print(text)

答案 14 :(得分:2)

如果要从表中提取文本,我发现表格易于实现,准确且快速:

获取熊猫数据框:

import tabula

df = tabula.read_pdf('your.pdf')

df

默认情况下,它忽略表外的页面内容。到目前为止,我仅在单页单表文件上进行过测试,但是存在容纳多个页和/或多个表的变数。

通过以下方式安装:

pip install tabula-py
# or
conda install -c conda-forge tabula-py 

关于直接文本提取,请参见: https://stackoverflow.com/a/63190886/9249533

答案 15 :(得分:2)

我在PDFLayoutTextStripper

处找到了解决方案

这很好,因为它可以保留原始PDF的布局

它是用Java编写的,但是我添加了一个网关来支持Python。

示例代码:

from py4j.java_gateway import JavaGateway

gw = JavaGateway()
result = gw.entry_point.strip('samples/bus.pdf')

# result is a dict of {
#   'success': 'true' or 'false',
#   'payload': pdf file content if 'success' is 'true'
#   'error': error message if 'success' is 'false'
# }

print result['payload']

来自PDFLayoutTextStripper的示例输出: enter image description here

您可以在这里Stripper with Python

查看更多详细信息

答案 16 :(得分:2)

以下是提取文本的最简单代码

<强>码

# importing required modules
import PyPDF2

# creating a pdf file object
pdfFileObj = open('filename.pdf', 'rb')

# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# printing number of pages in pdf file
print(pdfReader.numPages)

# creating a page object
pageObj = pdfReader.getPage(5)

# extracting text from page
print(pageObj.extractText())

# closing the pdf file object
pdfFileObj.close()

答案 17 :(得分:1)

要从PDF中提取文本,请使用以下代码

import PyPDF2
pdfFileObj = open('mypdf.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

print(pdfReader.numPages)

pageObj = pdfReader.getPage(0)

a = pageObj.extractText()

print(a)

答案 18 :(得分:1)

我已经尝试了许多Python PDF转换器,Tika是最好的。

from tika import parser

raw = parser.from_file("///Users/Documents/Textos/Texto1.pdf")
raw = str(raw)

safe_text = raw.encode('utf-8', errors='ignore')

safe_text = str(safe_text).replace("\n", "").replace("\\", "")
print('--- safe text ---' )
print( safe_text )

答案 19 :(得分:1)

一种更可靠的方法,假设有多个PDF或只有一个!!

import os
from PyPDF2 import PdfFileWriter, PdfFileReader
from io import BytesIO

mydir = # specify path to your directory where PDF or PDF's are

for arch in os.listdir(mydir): 
    buffer = io.BytesIO()
    archpath = os.path.join(mydir, arch)
    with open(archpath) as f:
            pdfFileObj = open(archpath, 'rb')
            pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
            pdfReader.numPages
            pageObj = pdfReader.getPage(0) 
            ley = pageObj.extractText()
            file1 = open("myfile.txt","w")
            file1.writelines(ley)
            file1.close()
            

答案 20 :(得分:1)

您可以从Here下载tika-app-xxx.jar(最新)。

然后将这个.jar文件放入python脚本文件的同一文件夹中。

然后在脚本中插入以下代码:

import os
import os.path

tika_dir=os.path.join(os.path.dirname(__file__),'<tika-app-xxx>.jar')

def extract_pdf(source_pdf:str,target_txt:str):
    os.system('java -jar '+tika_dir+' -t {} > {}'.format(source_pdf,target_txt))

此方法的优点:

更少的依赖。单个.jar文件更易于管理python软件包。

多格式支持。位置source_pdf可以是任何类型的文档的目录。 (.doc,.html,.odt等)

最新。 tika-app.jar总是比tika python软件包的相关版本更早发布。

稳定。它比PyPDF更加稳定和维护(由Apache提供支持)。

缺点:

无头Jre是必要的。

答案 21 :(得分:1)

我正在添加代码来完成此任务: 这对我来说很好:

# This works in python 3
# required python packages
# tabula-py==1.0.0
# PyPDF2==1.26.0
# Pillow==4.0.0
# pdfminer.six==20170720

import os
import shutil
import warnings
from io import StringIO

import requests
import tabula
from PIL import Image
from PyPDF2 import PdfFileWriter, PdfFileReader
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

warnings.filterwarnings("ignore")


def download_file(url):
    local_filename = url.split('/')[-1]
    local_filename = local_filename.replace("%20", "_")
    r = requests.get(url, stream=True)
    print(r)
    with open(local_filename, 'wb') as f:
        shutil.copyfileobj(r.raw, f)

    return local_filename


class PDFExtractor():
    def __init__(self, url):
        self.url = url

    # Downloading File in local
    def break_pdf(self, filename, start_page=-1, end_page=-1):
        pdf_reader = PdfFileReader(open(filename, "rb"))
        # Reading each pdf one by one
        total_pages = pdf_reader.numPages
        if start_page == -1:
            start_page = 0
        elif start_page < 1 or start_page > total_pages:
            return "Start Page Selection Is Wrong"
        else:
            start_page = start_page - 1

        if end_page == -1:
            end_page = total_pages
        elif end_page < 1 or end_page > total_pages - 1:
            return "End Page Selection Is Wrong"
        else:
            end_page = end_page

        for i in range(start_page, end_page):
            output = PdfFileWriter()
            output.addPage(pdf_reader.getPage(i))
            with open(str(i + 1) + "_" + filename, "wb") as outputStream:
                output.write(outputStream)

    def extract_text_algo_1(self, file):
        pdf_reader = PdfFileReader(open(file, 'rb'))
        # creating a page object
        pageObj = pdf_reader.getPage(0)

        # extracting extract_text from page
        text = pageObj.extractText()
        text = text.replace("\n", "").replace("\t", "")
        return text

    def extract_text_algo_2(self, file):
        pdfResourceManager = PDFResourceManager()
        retstr = StringIO()
        la_params = LAParams()
        device = TextConverter(pdfResourceManager, retstr, codec='utf-8', laparams=la_params)
        fp = open(file, 'rb')
        interpreter = PDFPageInterpreter(pdfResourceManager, device)
        password = ""
        max_pages = 0
        caching = True
        page_num = set()

        for page in PDFPage.get_pages(fp, page_num, maxpages=max_pages, password=password, caching=caching,
                                      check_extractable=True):
            interpreter.process_page(page)

        text = retstr.getvalue()
        text = text.replace("\t", "").replace("\n", "")

        fp.close()
        device.close()
        retstr.close()
        return text

    def extract_text(self, file):
        text1 = self.extract_text_algo_1(file)
        text2 = self.extract_text_algo_2(file)

        if len(text2) > len(str(text1)):
            return text2
        else:
            return text1

    def extarct_table(self, file):

        # Read pdf into DataFrame
        try:
            df = tabula.read_pdf(file, output_format="csv")
        except:
            print("Error Reading Table")
            return

        print("\nPrinting Table Content: \n", df)
        print("\nDone Printing Table Content\n")

    def tiff_header_for_CCITT(self, width, height, img_size, CCITT_group=4):
        tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
        return struct.pack(tiff_header_struct,
                           b'II',  # Byte order indication: Little indian
                           42,  # Version number (always 42)
                           8,  # Offset to first IFD
                           8,  # Number of tags in IFD
                           256, 4, 1, width,  # ImageWidth, LONG, 1, width
                           257, 4, 1, height,  # ImageLength, LONG, 1, lenght
                           258, 3, 1, 1,  # BitsPerSample, SHORT, 1, 1
                           259, 3, 1, CCITT_group,  # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
                           262, 3, 1, 0,  # Threshholding, SHORT, 1, 0 = WhiteIsZero
                           273, 4, 1, struct.calcsize(tiff_header_struct),  # StripOffsets, LONG, 1, len of header
                           278, 4, 1, height,  # RowsPerStrip, LONG, 1, lenght
                           279, 4, 1, img_size,  # StripByteCounts, LONG, 1, size of extract_image
                           0  # last IFD
                           )

    def extract_image(self, filename):
        number = 1
        pdf_reader = PdfFileReader(open(filename, 'rb'))

        for i in range(0, pdf_reader.numPages):

            page = pdf_reader.getPage(i)

            try:
                xObject = page['/Resources']['/XObject'].getObject()
            except:
                print("No XObject Found")
                return

            for obj in xObject:

                try:

                    if xObject[obj]['/Subtype'] == '/Image':
                        size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                        data = xObject[obj]._data
                        if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                            mode = "RGB"
                        else:
                            mode = "P"

                        image_name = filename.split(".")[0] + str(number)

                        print(xObject[obj]['/Filter'])

                        if xObject[obj]['/Filter'] == '/FlateDecode':
                            data = xObject[obj].getData()
                            img = Image.frombytes(mode, size, data)
                            img.save(image_name + "_Flate.png")
                            # save_to_s3(imagename + "_Flate.png")
                            print("Image_Saved")

                            number += 1
                        elif xObject[obj]['/Filter'] == '/DCTDecode':
                            img = open(image_name + "_DCT.jpg", "wb")
                            img.write(data)
                            # save_to_s3(imagename + "_DCT.jpg")
                            img.close()
                            number += 1
                        elif xObject[obj]['/Filter'] == '/JPXDecode':
                            img = open(image_name + "_JPX.jp2", "wb")
                            img.write(data)
                            # save_to_s3(imagename + "_JPX.jp2")
                            img.close()
                            number += 1
                        elif xObject[obj]['/Filter'] == '/CCITTFaxDecode':
                            if xObject[obj]['/DecodeParms']['/K'] == -1:
                                CCITT_group = 4
                            else:
                                CCITT_group = 3
                            width = xObject[obj]['/Width']
                            height = xObject[obj]['/Height']
                            data = xObject[obj]._data  # sorry, getData() does not work for CCITTFaxDecode
                            img_size = len(data)
                            tiff_header = self.tiff_header_for_CCITT(width, height, img_size, CCITT_group)
                            img_name = image_name + '_CCITT.tiff'
                            with open(img_name, 'wb') as img_file:
                                img_file.write(tiff_header + data)

                            # save_to_s3(img_name)
                            number += 1
                except:
                    continue

        return number

    def read_pages(self, start_page=-1, end_page=-1):

        # Downloading file locally
        downloaded_file = download_file(self.url)
        print(downloaded_file)

        # breaking PDF into number of pages in diff pdf files
        self.break_pdf(downloaded_file, start_page, end_page)

        # creating a pdf reader object
        pdf_reader = PdfFileReader(open(downloaded_file, 'rb'))

        # Reading each pdf one by one
        total_pages = pdf_reader.numPages

        if start_page == -1:
            start_page = 0
        elif start_page < 1 or start_page > total_pages:
            return "Start Page Selection Is Wrong"
        else:
            start_page = start_page - 1

        if end_page == -1:
            end_page = total_pages
        elif end_page < 1 or end_page > total_pages - 1:
            return "End Page Selection Is Wrong"
        else:
            end_page = end_page

        for i in range(start_page, end_page):
            # creating a page based filename
            file = str(i + 1) + "_" + downloaded_file

            print("\nStarting to Read Page: ", i + 1, "\n -----------===-------------")

            file_text = self.extract_text(file)
            print(file_text)
            self.extract_image(file)

            self.extarct_table(file)
            os.remove(file)
            print("Stopped Reading Page: ", i + 1, "\n -----------===-------------")

        os.remove(downloaded_file)


# I have tested on these 3 pdf files
# url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Healthcare-January-2017.pdf"
url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Sample_Test.pdf"
# url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Sazerac_FS_2017_06_30%20Annual.pdf"
# creating the instance of class
pdf_extractor = PDFExtractor(url)

# Getting desired data out
pdf_extractor.read_pages(15, 23)

答案 22 :(得分:0)

如果您在Windows的Anaconda中进行尝试,PyPDF2可能无法处理某些具有非标准结构或Unicode字符的PDF。如果您需要打开和阅读大量pdf文件,我建议使用以下代码-相对路径为.//pdfs//的文件夹中所有pdf文件的文本都将存储在列表pdf_text_list中。

from tika import parser
import glob

def read_pdf(filename):
    text = parser.from_file(filename)
    return(text)


all_files = glob.glob(".\\pdfs\\*.pdf")
pdf_text_list=[]
for i,file in enumerate(all_files):
    text=read_pdf(file)
    pdf_text_list.append(text['content'])

print(pdf_text_list)

答案 23 :(得分:0)

PyPDF2确实可以工作,但是结果可能会有所不同。从结果提取中我发现不一致的结果。

reader=PyPDF2.pdf.PdfFileReader(self._path)
eachPageText=[]
for i in range(0,reader.getNumPages()):
    pageText=reader.getPage(i).extractText()
    print(pageText)
    eachPageText.append(pageText)

答案 24 :(得分:0)

Camelot 在 Python 中从 PDF 中提取表格似乎是一个相当强大的解决方案。

乍一看,它似乎实现了几乎与 CreekGeek 建议的 tabula-py 包一样准确的提取,就可靠性而言,它已经超过了今天任何其他已发布的解决方案,但据推测它是 {{3 }}。此外,它还有自己的准确度指标 (results.parsing_report) 和出色的调试功能。

Camelot 和 Tabula 都以 Pandas 的 DataFrame 形式提供结果,因此很容易在之后调整表格。

pip install camelot-py

(不要与 camelot 包混淆。)

import camelot

df_list = []
results = camelot.read_pdf("file.pdf", ...)
for table in results:
    print(table.parsing_report)
    df_list.append(results[0].df)

它还可以将结果输出为 CSV、JSON、HTML 或 Excel。

Camelot 以牺牲一些依赖为代价。

注意:由于我的输入非常复杂,有许多不同的表格,我最终使用了 Camelot 和 Tabula,具体取决于表格,以达到最佳效果。

答案 25 :(得分:0)

试用borb,一个纯python PDF 库

import typing  
from borb.pdf.document import Document  
from borb.pdf.pdf import PDF  
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction  


def main():

    # variable to hold Document instance
    doc: typing.Optional[Document] = None  

    # this implementation of EventListener handles text-rendering instructions
    l: SimpleTextExtraction = SimpleTextExtraction()  

    # open the document, passing along the array of listeners
    with open("input.pdf", "rb") as in_file_handle:  
        doc = PDF.loads(in_file_handle, [l])  
  
    # were we able to read the document?
    assert doc is not None  

    # print the text on page 0
    print(l.get_text(0))  

if __name__ == "__main__":
    main()

答案 26 :(得分:0)

您可以使用 pytessaract 和 OpenCV 简单地完成此操作。请参考以下代码。您可以从 this article 获得更多详细信息。

import os
from PIL import Image
from pdf2image import convert_from_path
import pytesseract

filePath = ‘021-DO-YOU-WONDER-ABOUT-RAIN-SNOW-SLEET-AND-HAIL-Free-Childrens-Book-By-Monkey-Pen.pdf’
doc = convert_from_path(filePath)

path, fileName = os.path.split(filePath)
fileBaseName, fileExtension = os.path.splitext(fileName)

for page_number, page_data in enumerate(doc):
txt = pytesseract.image_to_string(page_data).encode(“utf-8”)
print(“Page # {} — {}”.format(str(page_number),txt))

答案 27 :(得分:-1)

如何从PDF文件中提取文本?

首先要了解的是PDF format。它具有以英语编写的公共规范,请参见ISO 32000-2:2017,并阅读了700多页的PDF 1.7 specification。您当然至少需要阅读有关PDF

的维基百科页面

一旦您了解了PDF格式的细节,提取文本或多或少就很容易了(但是出现在图形或图像中的文本又如何呢?其图形1)呢?不要指望在几周内独自编写一个完善的软件文本提取器....

在Linux上,您还可以使用pdf2text,可以从Python代码中popen

通常,从PDF文件提取文本是一个不确定的问题。对于人类读者来说,可以用不同的点或照片等来制作一些文字(如图)……

Google搜索引擎能够从PDF提取文本,但据传需要超过十亿行的源代码。您是否有必要的资源(人力,预算)来发展竞争对手?

一种可能是将PDF打印到某个虚拟打印机(例如使用GhostScriptFirefox),然后使用OCR技术提取文本。

我建议改为处理生成该PDF文件的数据表示形式,例如原始的LaTeX代码(或Lout代码)或OOXML代码。 / p>

在所有情况下,您都需要预算至少几年的软件开发时间。