如何使用PDFBox在PDF中查找空白页?

时间:2014-05-19 13:10:13

标签: java pdf

这是我目前面临的挑战 我有很多PDF,我必须删除其中的空白页面,只显示包含内容(文本或图像)的页面。
问题是那些pdf是扫描文件 因此,空白页面的扫描仪会留下一些脏污。

3 个答案:

答案 0 :(得分:4)

我做了一些研究并最终得到了这个代码,它将99%的页面检查为白色或浅灰色。 我需要灰色因子,因为扫描的文档有时不是纯白色。

private static Boolean isBlank(PDPage pdfPage) throws IOException {
    BufferedImage bufferedImage = pdfPage.convertToImage();
    long count = 0;
    int height = bufferedImage.getHeight();
    int width = bufferedImage.getWidth();
    Double areaFactor = (width * height) * 0.99;

    for (int x = 0; x < width ; x++) {
        for (int y = 0; y < height ; y++) {
            Color c = new Color(bufferedImage.getRGB(x, y));
            // verify light gray and white
            if (c.getRed() == c.getGreen() && c.getRed() == c.getBlue()
                    && c.getRed() >= 248) {
                 count++;
            }
        }
    }

    if (count >= areaFactor) {
        return true;
    }

    return false;
}

答案 1 :(得分:0)

http://www.rgagnon.com/javadetails/java-detect-and-remove-blank-page-in-pdf.html

import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;

import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.io.RandomAccessSourceFactory;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfDictionary;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;

public class RemoveBlankPageFromPDF {

    // value where we can consider that this is a blank image
    // can be much higher or lower depending of what is considered as a blank page
    public static final int BLANK_THRESHOLD = 160;

    public static void removeBlankPdfPages(String source, String destination)
        throws IOException, DocumentException
    {
        PdfReader r = null;
        RandomAccessSourceFactory rasf = null;
        RandomAccessFileOrArray raf = null;
        Document document = null;
        PdfCopy writer = null;

        try {
            r = new PdfReader(source);
            // deprecated
            //    RandomAccessFileOrArray raf
            //           = new RandomAccessFileOrArray(pdfSourceFile);
            // itext 5.4.1
            rasf = new RandomAccessSourceFactory();
            raf = new RandomAccessFileOrArray(rasf.createBestSource(source));
            document = new Document(r.getPageSizeWithRotation(1));
            writer = new PdfCopy(document, new FileOutputStream(destination));
            document.open();
            PdfImportedPage page = null;

            for (int i=1; i<=r.getNumberOfPages(); i++) {
                // first check, examine the resource dictionary for /Font or
                // /XObject keys.  If either are present -> not blank.
                PdfDictionary pageDict = r.getPageN(i);
                PdfDictionary resDict = (PdfDictionary) pageDict.get( PdfName.RESOURCES );
                boolean noFontsOrImages = true;
                if (resDict != null) {
                  noFontsOrImages = resDict.get( PdfName.FONT ) == null &&
                                    resDict.get( PdfName.XOBJECT ) == null;
                }
                System.out.println(i + " noFontsOrImages " + noFontsOrImages);

                if (!noFontsOrImages) {
                    byte bContent [] = r.getPageContent(i,raf);
                    ByteArrayOutputStream bs = new ByteArrayOutputStream();
                    bs.write(bContent);
                    System.out.println
                      (i + bs.size() + " > BLANK_THRESHOLD " +  (bs.size() > BLANK_THRESHOLD));
                    if (bs.size() > BLANK_THRESHOLD) {
                        page = writer.getImportedPage(r, i);
                        writer.addPage(page);
                    }
                }
            }
        }
        finally {
            if (document != null) document.close();
            if (writer != null) writer.close();
            if (raf != null) raf.close();
            if (r != null) r.close();
        }
    }

    public static void main (String ... args) throws Exception {
        removeBlankPdfPages
            ("C://temp//documentwithblank.pdf", "C://temp//documentwithnoblank.pdf");
    }
}

答案 2 :(得分:0)

@Shoyo的代码适用于 PDFBox版本<2.0 。对于将来的读者来说,没有太大的变化,但是,以防万一,这里是 PDFBOX 2.0 + 的代码,使您的生活更轻松。

在您的main中(主要是指您将PDF加载到PDDocument中的位置)方法:

try {
    PDDocument document = PDDocument.load(new File("/home/codemantra/Downloads/tetml_ct_access/C.pdf"));
    PDFRenderer renderedDoc = new PDFRenderer(document);
    for (int pageNumber = 0; pageNumber < document.getNumberOfPages(); pageNumber++) {
        if(isBlank(renderedDoc.renderImage(pageNumber))) {
            System.out.println("Blank Page Number : " + pageNumber + 1);
        }
    }
} catch (Exception e) {
    e.printStackTrace();
} 

isBlank方法将只传入BufferedImage

private static Boolean isBlank(BufferedImage pageImage) throws IOException {
    BufferedImage bufferedImage = pageImage;
    long count = 0;
    int height = bufferedImage.getHeight();
    int width = bufferedImage.getWidth();
    Double areaFactor = (width * height) * 0.99;

    for (int x = 0; x < width; x++) {
        for (int y = 0; y < height; y++) {
            Color c = new Color(bufferedImage.getRGB(x, y));
            if (c.getRed() == c.getGreen() && c.getRed() == c.getBlue() && c.getRed() >= 248) {
                count++;
            }
        }
    }
    if (count >= areaFactor) {
        return true;
    }
    return false;
}
  

所有功劳归@Shoyo


更新

某些PDF的“此页被故意留为空白” ,以上代码被视为空白。如果这是您的要求,请随时使用上面的代码。但是,我的要求只是过滤掉完全空白的页面(不存在任何图像,也不包含任何字体)。因此,我最终使用了这段代码(加上这段代码运行得更快:P):

public static void main(String[] args) {
    try {
        PDDocument document = PDDocument.load(new File("/home/codemantra/Downloads/CTP2040.pdf"));
        PDPageTree allPages = document.getPages();
        Integer pageNumber = 1;
        for (PDPage page : allPages) {
            Iterable<COSName> xObjects = page.getResources().getXObjectNames();
            Iterable<COSName> fonts = page.getResources().getFontNames();
            if(xObjects.spliterator().getExactSizeIfKnown() == 0 && fonts.spliterator().getExactSizeIfKnown() == 0) {
                System.out.println(pageNumber);                 
            }
            pageNumber++;
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
}

这将返回完全空白的页面的页码。

希望这对某人有帮助! :)