Question

在一个目录中包含大量带有文本的PDF文件。我的想法是能够一次阅读所有内容并保存在字典中。现在，我只能通过使用textract库来逐个执行此操作：

import textract

text = textract.process('/Users/user/Documents/Data/CLAR.pdf', 
                        method='tesseract', 
                        language='eng')

如何立即阅读它们？我是否需要使用for循环在目录中搜索或以其他方式搜索？

Answer 1

一种解决方案可能是将os library与for loop

一起使用

import os
import textract

files_path = [os.path.abspath(x) for x in os.listdir()]

# Excluding not .pdf files
files_path = [pdf for pdf in files_path if '.pdf' in pdf]

pdfs = []
for file in files_path:
    text = textract.process(file,
                            method='tesseract',
                            language='eng')

    pdfs += [text]

获取当前目录中的所有文件
不排除.pdf个文件
将文本保存到列表中（可能是不同的数据结构）

在python中读取许多pdf文件

1 个答案: