Question

我正在尝试从pyspark中提取图像中的文本（使用pytesser-＆gt; Tesseract ocr引擎），我已将我的图像文件上传到hdfs并尝试从spark中读取。

这是我的代码

>>> import sys
>>> sys.path.append("/home/dsuser/Downloads/pytesser")
>>> from PIL import Image
>>> from pytesser import *
>>> image_file = 'hdfs://localhost:9000/image/image6.jpg'
>>> rdd = sc.binaryFiles(image_file)
>>> img = Image.open(rdd.first())
>>> text = image_to_string(img)
>>> print("=====output=======\n")
>>> print(text)

在运行时，可以从hdfs加载图像文件，但是我在调用下面的代码时遇到错误

>>> im = Image.open(rdd.first())

Traceback（最近一次调用最后一次）：文件“”，第1行，in 文件“/usr/lib/python2.7/dist-packages/PIL/Image.py”，行 2000年，公开 prefix = fp.read（16）AttributeError：'tuple'对象没有属性'read'

不确定错误，我需要帮助将图像转换为OCR

使用PySpark进行OCR处理的图像

0 个答案: