Question

我在使用iText阅读pdf内容时遇到了问题。我测试了所有不同的技术。他们都使用标准的pdf文档，但我有一个我需要修改的pdf文档，我无法获取内容。

本文档由PD4ML生成。它可以在Acrobat阅读器中阅读，但无法在Open Office中阅读。

例如使用命令

  PdfReader reader = new PdfReader(src);
  FileOutputStream out = new FileOutputStream(result);
  out.write(reader.getPageContent(1));

生成此输出： q Q q 29.18088 102.1433 536.9282 675.0511 re W n / Cs1 cs 1 1 1 sc 29.18088 775.5042 m 574.5602 775.5042 l 574.5602 -2599.312 l 29.18088 -2599.312 l h f Q q 43.26609 761.4189 m 560.475 761.4189 l 560.475 -2572.832 l 43.26609 -2572.832 l h W n 29.18088 102.1433 536.9282 675.0511 re W n q 24.78997 0 0 22.53634 51.71722 733.2485 cm / Im1 Do Q / Cs1 cs 0.2 0.2 0.2 sc / Cs1 CS 0.2 0.2 0.2 SC 0.5 w 2 J 2 Tr q 0.5634084 0 0 0.5634084 29.18088 711.2756 cm BT 20 0 0 20 40 0 Tm / G1 1 Tf [＆lt; 0033＆gt; 1＆lt; 004800550049＆gt; 1＆lt; 00520055005000440051004600480003＆gt; 1＆lt; 0044005100470003＆gt;

但是当我试图获取文本上下文时，有文本项，它们不会显示。就像文本格式不同一样。

此代码：

    PdfReader reader = new PdfReader(src);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader); PrintWriter out = new PrintWriter(new FileOutputStream(result)); TextExtractionStrategy strategy;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
      out.println(strategy.getResultantText());
    }

只是产生空间。 TextLocationStrategy也是如此。

命令 PdfContentReaderTool.listContentStream（new File（src），out）;

可生产 ============== Page 1 ==================== - - - - - 字典 - - - - - - （/ Parent =类型字典：/ Pages，/ Contents = Stream，/ Type = / Page，/ Resources = Dictionary，/ MediaBox = [0,0,595.29,841.89]） Subdictionary / Parent =（/ Type = / Pages，/ MediaBox = [0,0,595.29,841.89]，/ Count = 6，/ Kids = [2 0 R，14 0 R，26 0 R，30 0 R，34 0 R，38 0 R]） Subdictionary / Resources =（/ XObject = Dictionary，/ ProcSet = [/ PDF，/ Text，/ ImageB，/ ImageC，/ ImageI]，/ ColorSpace = Dictionary，/ Font = Dictionary） Subdictionary / XObject =（/ Im1 =类型的流：/ XObject） Subdictionary / ColorSpace =（/ Cs1 = [/ ICCBased，12 0 R]） Subdictionary / Font =（/ G2 =类型字典：/ Font，/ G1 =类型字典：/ Font） Subdictionary / G2 =（/ BaseFont = / HCNQGU + font000000001c036002，/ DescendantFonts = [50 0 R]，/ Type = / Font，/ Encoding = / Identity-H，/ Subtype = / Type0，/ ToUnicode = Stream）子词典/ G1 =（/ BaseFont = / HCZCBJ + font000000001c036002，/ DescendantFonts = [43 0 R]，/ Type = / Font，/ Encoding = / Identity-H，/ Subtype = / Type0，/ ToUnicode = Stream） - - - - - XObject摘要 - - - - - - ------ / Im1 - subtype = / Image = 9148 bytes ------

- - - - 内容流 - - - - - - q Q q 29.18088 102.1433 536.9282 675.0511 re W n / Cs1 cs 1 1 1 sc 29.18088 775.5042 m 574.5602 775.5042 l 574.5602 -2599.312 l 29.18088 -2599.312 l h f Q q 43.26609 761.4189 m 560.475 761.4189 l 560.475 -2572.832 l 43.26609 -2572.832 l h W n 29.18088 102.1433 536.9282 675.0511 re W n q 24.78997 0 0 22.53634 51.71722 733.2485 cm / Im1 Do Q / Cs1 cs 0.2 0.2 0.2 sc / Cs1 CS 0.2 0.2 0.2 SC 0.5 w 2 J 2 Tr q 0.5634084 0 0 0.5634084 29.18088 711.2756 cm BT 20 0 0 20 40 0 Tm / G1 1

但是文本提取部分是空的。

知道为什么我看不懂文字吗？在获取文本之前，我还能做些什么或测试吗？

任何指针欢迎。

吉勒

iText - 无法读取PD4ML生成的pdf

0 个答案: