Question

我正在尝试使用Square Annotation从pdf获取文字。我使用下面的代码使用PDFBOX.来从PDF中提取文本的 CODE

try {    
            PDDocument document = null;
            try {
                document = PDDocument.load(new File("//Users//" + usr + "//Desktop//BoldTest2 2.pdf"));
                List allPages = document.getDocumentCatalog().getAllPages();
                for (int i = 0; i < allPages.size(); i++) {
                    PDPage page = (PDPage) allPages.get(i);
                    Map<String, PDFont> pageFonts = page.getResources().getFonts();
                    List<PDAnnotation> la = page.getAnnotations();
                    for (int f = 0; f < la.size(); f++) {
                        PDAnnotation pdfAnnot = la.get(f);
                        PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                        stripper.setSortByPosition(true);
                        PDRectangle rect = pdfAnnot.getRectangle();

                        float x = 0;
                        float y = 0;
                        float width = 0;
                        float height = 0;
                        int rotation = page.findRotation();

                        if (rotation == 0) {
                            x = rect.getLowerLeftX();
                            y = rect.getUpperRightY() - 2;
                            width = rect.getWidth();
                            height = rect.getHeight();
                            PDRectangle pageSize = page.findMediaBox();
                            y = pageSize.getHeight() - y;
                        }
                        Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height);
                        stripper.addRegion(Integer.toString(f), awtRect);
                        stripper.extractRegions(page);
                        PrintTextLocation2 prt = new PrintTextLocation2();
                        if (pdfAnnot.getSubtype().equals("Square")) {
                            testTxt = testTxt + "\n " + stripper.getTextForRegion(Integer.toString(f));
                        }
                    }
                }
            } catch (Exception ex) {
            } finally {
                if (document != null) {
                    document.close();
                }
            }
        } catch (Exception ex) {
        }

通过使用此代码，我只能获取PDF文本。如何在文本中一起获取 BOLD ITALIC 等字体信息。建议或参考资料受到高度赞赏。

Answer 1

由PDFTextStripper扩展的PDFTextStripperByArea规范化（即删除格式化）文本（参见JavaDoc comment）：

* This class will take a pdf document and strip out all of the text and ignore the
* formatting and such.

如果查看源代码，您会看到此类中的字体信息可用，但在打印前已将其标准化：

protected void writePage() throws IOException
{
    [...]
        List<TextPosition> line = new ArrayList<TextPosition>();
        [...]
            if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
            {
                writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);
                line.clear();
                [...]
            }
............

ArrayList中的TextPosition个实例具有所有格式信息。解决方案可以专注于根据要求重新定义现有方法。我列出了以下几个选项：

private List normalize（List line，boolean isRtlDominant，boolean hasRtl）

如果您需要自己的normalize方法，则可以复制项目中的整个PDFTextStripper类并更改副本的代码。让我们将这个新类称为MyPDFTextStripper，然后根据需求定义新方法。同样，将PDFTextStripperByArea复制为MyPDFTextStripperByArea，这将扩展MyPDFTextStripper。

protected void writePage（）

如果您只需要一个新的writePage方法，则只需展开PDFTextStripper，然后覆盖此方法，然后按照上述说明创建MyPDFTextStripperByArea。

的WriteLine（正常化（线，isRtlDominant，hasRtl），isRtlDominant）

其他解决方案可能会通过将pre-normalization信息存储在某个变量中然后使用它来覆盖writeLine方法。

希望这有帮助。

PdfBox - 使用获取字体信息

1 个答案: