iText - 无法读取PD4ML生成的pdf

时间:2015-11-27 07:17:31

标签: itext pd4ml

我在使用iText阅读pdf内容时遇到了问题。我测试了所有不同的技术。他们都使用标准的pdf文档,但我有一个我需要修改的pdf文档,我无法获取内容。

本文档由PD4ML生成。它可以在Acrobat阅读器中阅读,但无法在Open Office中阅读。

例如使用命令

  PdfReader reader = new PdfReader(src);
  FileOutputStream out = new FileOutputStream(result);
  out.write(reader.getPageContent(1));

生成此输出: q Q q 29.18088 102.1433 536.9282 675.0511 re W n / Cs1 cs 1 1 1 sc 29.18088 775.5042 m 574.5602 775.5042 l 574.5602 -2599.312 l 29.18088 -2599.312 l h f Q q 43.26609 761.4189 m 560.475 761.4189 l 560.475 -2572.832 l 43.26609 -2572.832 l h W n 29.18088 102.1433 536.9282 675.0511 re W n q 24.78997 0 0 22.53634 51.71722 733.2485 cm / Im1 Do Q / Cs1 cs 0.2 0.2 0.2 sc / Cs1 CS 0.2 0.2 0.2 SC 0.5 w 2 J 2 Tr q 0.5634084 0 0 0.5634084 29.18088 711.2756 cm BT 20 0 0 20 40 0 Tm / G1 1 Tf [< 0033> 1< 004800550049> 1< 00520055005000440051004600480003> 1< 0044005100470003>

但是当我试图获取文本上下文时,有文本项,它们不会显示。就像文本格式不同一样。

此代码:

    PdfReader reader = new PdfReader(src);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader); PrintWriter out = new PrintWriter(new FileOutputStream(result)); TextExtractionStrategy strategy;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
      out.println(strategy.getResultantText());
    }

只是产生空间。 TextLocationStrategy也是如此。

命令     PdfContentReaderTool.listContentStream(new File(src),out);

可生产 ============== Page 1 ==================== - - - - - 字典 - - - - - - (/ Parent =类型字典:/ Pages,/ Contents = Stream,/ Type = / Page,/ Resources = Dictionary,/ MediaBox = [0,0,595.29,841.89])     Subdictionary / Parent =(/ Type = / Pages,/ MediaBox = [0,0,595.29,841.89],/ Count = 6,/ Kids = [2 0 R,14 0 R,26 0 R,30 0 R,34 0 R,38 0 R])     Subdictionary / Resources =(/ XObject = Dictionary,/ ProcSet = [/ PDF,/ Text,/ ImageB,/ ImageC,/ ImageI],/ ColorSpace = Dictionary,/ Font = Dictionary)         Subdictionary / XObject =(/ Im1 =类型的流:/ XObject)         Subdictionary / ColorSpace =(/ Cs1 = [/ ICCBased,12 0 R])         Subdictionary / Font =(/ G2 =类型字典:/ Font,/ G1 =类型字典:/ Font)             Subdictionary / G2 =(/ BaseFont = / HCNQGU + font000000001c036002,/ DescendantFonts = [50 0 R],/ Type = / Font,/ Encoding = / Identity-H,/ Subtype = / Type0,/ ToUnicode = Stream)             子词典/ G1 =(/ BaseFont = / HCZCBJ + font000000001c036002,/ DescendantFonts = [43 0 R],/ Type = / Font,/ Encoding = / Identity-H,/ Subtype = / Type0,/ ToUnicode = Stream) - - - - - XObject摘要 - - - - - - ------ / Im1 - subtype = / Image = 9148 bytes ------

          • 内容流 - - - - - - q Q q 29.18088 102.1433 536.9282 675.0511 re W n / Cs1 cs 1 1 1 sc 29.18088 775.5042 m 574.5602 775.5042 l 574.5602 -2599.312 l 29.18088 -2599.312 l h f Q q 43.26609 761.4189 m 560.475 761.4189 l 560.475 -2572.832 l 43.26609 -2572.832 l h W n 29.18088 102.1433 536.9282 675.0511 re W n q 24.78997 0 0 22.53634 51.71722 733.2485 cm / Im1 Do Q / Cs1 cs 0.2 0.2 0.2 sc / Cs1 CS 0.2 0.2 0.2 SC 0.5 w 2 J 2 Tr q 0.5634084 0 0 0.5634084 29.18088 711.2756 cm BT 20 0 0 20 40 0 Tm / G1 1

但是文本提取部分是空的。

知道为什么我看不懂文字吗?在获取文本之前,我还能做些什么或测试吗?

任何指针欢迎。

吉勒

0 个答案:

没有答案
相关问题