图像中的文字是否为粗体?

时间:2019-01-02 15:40:03

标签: python ocr tesseract

我一直在用Tesseract OCR试验拉特利。我可以在图像中找到字符,但是在图像中只能找到粗体字符时遇到了麻烦(知道文档图像中的字符是否为粗体)。我在Tesseract API的另一个问题(Can I use OCR to detect font style (bold, italic)?)中看到了函数WordFontAttributes(),但我无法在Python中实现它。

1 个答案:

答案 0 :(得分:0)

在安装tesseract 3.05之前(第4版不支持WordFontAttributes)

from tesserocr import PyTessBaseAPI, RIL, iterate_level


def get_words_info(image_path, tessdata_path):
    """
    get path to image and path to tessdata and return dict with info about each word
    """
    # api = PyTessBaseAPI(path=tessdata_path)
    with PyTessBaseAPI(path=tessdata_path) as api:
        api.SetImageFile(image_path)
        api.Recognize()
        iter = api.GetIterator()
        level = RIL.WORD

        result = []

        for r in iterate_level(iter, level):
            element = r.GetUTF8Text(level)
            word_attributes = r.WordFontAttributes()
            base_line = r.BoundingBox(level)

            if element:
                word_attributes['word'] = element
                word_attributes['position'] = base_line

            result.append(word_attributes)

        return result