通过pdfbox从pdf文件中提取文本

时间:2019-02-14 11:11:34

标签: java pdfbox

我在阅读PDF时遇到问题。

public class GetLinesFromPDF extends PDFTextStripper {

    static List<String> lines = new ArrayList<String>();
    Map<String, String> auMap = new HashMap();
    boolean objFlag = false;

    public GetLinesFromPDF() throws IOException {
    }

    /**
     * @throws IOException If there is an error parsing the document.
     */
    public static void main(String[] args) throws IOException {
        PDDocument document = null;
        String fileName = "E:\\sample.pdf";
        try {
            int i;
            document = PDDocument.load(new File(fileName));
            PDFTextStripper stripper = new GetLinesFromPDF();
            stripper.setSortByPosition(true);
            stripper.setStartPage(0);
            stripper.setEndPage(document.getNumberOfPages());

            Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
            stripper.writeText(document, dummy);

            // print lines
            for (String line : lines) {
                //System.out.println("line = " + line);
                if (line.matches("(.*)Objection(.*)")) {
                    System.out.println(line);
                    withObjection(lines);
                    //System.out.println("iiiiiiiiiiii");
                    break;
                }
                //System.out.println("uuuuuuuuuuuuuu");

            }
        } finally {
            if (document != null) {
                document.close();
            }
        }
    }

    /**
     * Override the default functionality of PDFTextStripper.writeString()
     */
    @Override
    protected void writeString(String string, List<TextPosition> textPositions) throws IOException {

        System.out.println("textPositions = " + string);

        // System.out.println("tex   "+textPositions.get(0).getFont()+ getArticleEnd());
        // you may process the line here itself, as and when it is obtained
    }
}

需要类似的输出 我的pdf有一些标题,我们需要跳过相同的标题。

pdf文件内容为

pdf output and pdf content

如何按照指定的单独格式提取文本。

提前谢谢。

0 个答案:

没有答案