Question

我有一些pdf文件，每页有两列。我想通过程序从这些文件中提取文本。 pdf文件的内容是中文。我试图使用python3和ghostscript的pdfminer3k库，其结果都不是很好。

最后，我使用名为textract的github开源项目，链接为deanmalmgren/textract。

但是textract无法检测到包含两列的每个页面。我使用以下命令：

import textract
text = textract.process("/home/name/Downloads/textract-master/test.pdf")
print text

pdf文件链接为https://pan.baidu.com/s/1nvLQnLf 输出结果显示提取程序将两列视为一列。我想提取双列pdf文件。怎么解决？

这是提取程序的输出结果。