如何调用pypdfocr函数在python脚本中使用它们?

时间:2016-10-11 23:38:29

标签: python python-2.7 python-3.x pdfbox

最近我下载了pypdfocr,但是在文档中没有关于如何将pypdfocr作为库调用的示例,有人可以帮我调用它来转换单个文件吗?我刚刚找到了一个终端命令:

$ pypdfocr filename.pdf

1 个答案:

答案 0 :(得分:1)

如果您正在寻找源代码,它通常位于您的python安装目录site-package下。更重要的是,如果您使用的是IDE(即Pycharm),它将帮助您找到目录和文件。这对于查找类非常有用,并向您展示如何实例化它,例如: https://github.com/virantha/pypdfocr/blob/master/pypdfocr/pypdfocr.py 这个文件有一个pypdfocr类类型,你可以重用它,并且可能做一个命令行会做的事情。

在该类中,开发人员已经提出了很多要解析的论据:

def get_options(self, argv):
    """
        Parse the command-line options and set the following object properties:
        :param argv: usually just sys.argv[1:]
        :returns: Nothing
        :ivar debug: Enable logging debug statements
        :ivar verbose: Enable verbose logging
        :ivar enable_filing: Whether to enable post-OCR filing of PDFs
        :ivar pdf_filename: Filename for single conversion mode
        :ivar watch_dir: Directory to watch for files to convert
        :ivar config: Dict of the config file
        :ivar watch: Whether folder watching mode is turned on
        :ivar enable_evernote: Enable filing to evernote
    """
    p = argparse.ArgumentParser(description = "Convert scanned PDFs into their OCR equivalent.  Depends on GhostScript and Tesseract-OCR being installed.",
            epilog = "PyPDFOCR version %s (Copyright 2013 Virantha Ekanayake)" % __version__,
            )

    p.add_argument('-d', '--debug', action='store_true',
        default=False, dest='debug', help='Turn on debugging')

    p.add_argument('-v', '--verbose', action='store_true',
        default=False, dest='verbose', help='Turn on verbose mode')

    p.add_argument('-m', '--mail', action='store_true',
        default=False, dest='mail', help='Send email after conversion')

    p.add_argument('-l', '--lang',
        default='eng', dest='lang', help='Language(default eng)')


    p.add_argument('--preprocess', action='store_true',
            default=False, dest='preprocess', help='Enable preprocessing.  Not really useful now with improved Tesseract 3.04+')

    p.add_argument('--skip-preprocess', action='store_true',
            default=False, dest='skip_preprocess', help='DEPRECATED: always skips now.')

    #---------
    # Single or watch mode
    #--------
    single_or_watch_group = p.add_mutually_exclusive_group(required=True)
    # Positional argument for single file conversion
    single_or_watch_group.add_argument("pdf_filename", nargs="?", help="Scanned pdf file to OCR")
    # Watch directory for watch mode
    single_or_watch_group.add_argument('-w', '--watch', 
         dest='watch_dir', help='Watch given directory and run ocr automatically until terminated')

    #-----------
    # Filing options
    #----------
    filing_group = p.add_argument_group(title="Filing optinos")
    filing_group.add_argument('-f', '--file', action='store_true',
        default=False, dest='enable_filing', help='Enable filing of converted PDFs')
    #filing_group.add_argument('-c', '--config', type = argparse.FileType('r'),
    filing_group.add_argument('-c', '--config', type = lambda x: open_file_with_timeout(p,x),
         dest='configfile', help='Configuration file for defaults and PDF filing')
    filing_group.add_argument('-e', '--evernote', action='store_true',
        default=False, dest='enable_evernote', help='Enable filing to Evernote')
    filing_group.add_argument('-n', action='store_true',
        default=False, dest='match_using_filename', help='Use filename to match if contents did not match anything, before filing to default folder')


    # Add flow option to single mode extract_images,preprocess,ocr,write

    args = p.parse_args(argv)

您可以将任何这些参数传递给它的解析器,如下所示:

import pypdfocr

obj = pypdfocr.pypdfocr.pypdfocr()
obj.get_options([]) # this makes it takes default, but you could add CLI option to it.  Other option might be [-v] or [-d,-v]

我希望这有助于您理解:)