Question

Apache Tika 1.6能够从PDF文档中提取内嵌图像。但是，我一直在努力让它发挥作用。

我的用例是我想要一些能够从任何文档（不一定是PDF）中提取内容和单独图像的代码。然后将其传递到Apache UIMA管道。

我已经能够通过使用自定义解析器（构建在AutoParser上）从其他文档类型中提取图像，将文档转换为HTML，然后单独保存图像。当我尝试使用PDF时，标签甚至不会出现在HTML中，让我来访问这些文件。

有人可以建议我如何实现上述目标，最好是一些代码示例，说明如何使用Tika 1.6从PDF中提取内联图像？

Answer 1

尝试下面的代码，ContentHandler转为你的xml内容。

public ContentHandler convertPdf(byte[] content, String path, String filename)throws IOException, SAXException, TikaException{           

    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    ContentHandler handler =   new ToXMLContentHandler();
    PDFParser parser = new PDFParser(); 

    PDFParserConfig config = new PDFParserConfig();
    config.setExtractInlineImages(true);
    config.setExtractUniqueInlineImagesOnly(true);

    parser.setPDFParserConfig(config);


    EmbeddedDocumentExtractor embeddedDocumentExtractor = 
            new EmbeddedDocumentExtractor() {
        @Override
        public boolean shouldParseEmbedded(Metadata metadata) {
            return true;
        }
        @Override
        public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
                throws SAXException, IOException {
            Path outputFile = new File(path+metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
            Files.copy(stream, outputFile);
        }
    };

    context.set(PDFParser.class, parser);
    context.set(EmbeddedDocumentExtractor.class,embeddedDocumentExtractor );

    try (InputStream stream = new ByteArrayInputStream(content)) {
        parser.parse(stream, handler, metadata, context);
    }

    return handler;
}

Answer 2

可以使用AutoParser来提取图像，而无需依赖PDFParser。这段代码同样适用于从docx，pptx等提取图像。

在这里，我有一个parseDocument()和一个setPdfConfig()函数，该函数利用了AutoParser。

我创建了一个AutoParser
将EmbeddedDocumentExtractor附加到ParseContext上。
将AutoParser附加到同一ParseContext上。
将PDFParserConfig附加到同一ParseContext上。
然后将ParseContext赋予AutoParser.parse()。

图像被保存到与源文件相同位置的文件夹中，名称为<sourceFile>_/。

private static void setPdfConfig(ParseContext context) {
    PDFParserConfig pdfConfig = new PDFParserConfig();
    pdfConfig.setExtractInlineImages(true);
    pdfConfig.setExtractUniqueInlineImagesOnly(true);

    context.set(PDFParserConfig.class, pdfConfig);
}

private static String parseDocument(String path) {
    String xhtmlContents = "";

    AutoDetectParser parser = new AutoDetectParser();
    ContentHandler handler = new ToXMLContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    EmbeddedDocumentExtractor embeddedDocumentExtractor = 
            new EmbeddedDocumentExtractor() {
        @Override
        public boolean shouldParseEmbedded(Metadata metadata) {
            return true;
        }
        @Override
        public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
                throws SAXException, IOException {
            Path outputDir = new File(path + "_").toPath();
            Files.createDirectories(outputDir);

            Path outputPath = new File(outputDir.toString() + "/" + metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
            Files.deleteIfExists(outputPath);
            Files.copy(stream, outputPath);
        }
    };

    context.set(EmbeddedDocumentExtractor.class, embeddedDocumentExtractor);
    context.set(AutoParser.class, parser);

    setPdfConfig(context);

    try (InputStream stream = new FileInputStream(path)) {
        parser.parse(stream, handler, metadata, context);
        xhtmlContents = handler.toString();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException | TikaException e) {
        e.printStackTrace();
    }

    return xhtmlContents;
}

使用Apache Tika从PDF中提取图像

2 个答案: