如何从tika服务器获取页数信息?

时间:2014-10-30 08:40:40

标签: apache-tika

我想通过使用tika服务器来学习doc文件的页数。我运行tika服务器;

java -jar  tika-server-1.6.jar    

并使用curl获取元数据;

curl -X PUT -T /tmp/test.doc http://localhost:9998/meta

输出结果为:

"Revision-Number","0"
"Last-Printed","1601-01-01T00:00:00Z"
"cp:revision","0"
"meta:print-date","1601-01-01T00:00:00Z"
"meta:creation-date","2014-10-30T06:04:11Z"
"dcterms:modified","1601-01-01T00:00:00Z"
"meta:save-date","1601-01-01T00:00:00Z"
"dc:creator","ndemir "
"Last-Modified","1601-01-01T00:00:00Z"
"Author","ndemir "
"dcterms:created","2014-10-30T06:04:11Z"
"date","1601-01-01T00:00:00Z"
"X-Parsed-By","org.apache.tika.parser.ParserDecorator$1","org.apache.tika.parser.microsoft.OfficeParser"
"modified","1601-01-01T00:00:00Z"
"creator","ndemir "
"Creation-Date","2014-10-30T06:04:11Z"
"meta:author","ndemir "
"Content-Type","application/msword"
"Last-Save-Date","1601-01-01T00:00:00Z"

如您所见,没有关于页数的信息。如何从tika服务器获取页数信息?

1 个答案:

答案 0 :(得分:1)

Tika只会在存储在文件中时提供该信息。大多数Microsoft Office文档都包含它,但有些文件不包含在内。对于那些,您需要在Office中加载它们,告诉Office重新计算统计信息/页数,然后保存。一旦它出现在文件中,Tika就能找到它

如果我们尝试使用Tika附带的一个测试单词文档,那么我们会看到它:

$ curl -q -X PUT -T tika-parsers/src/test/resources/test-documents/testWORD.doc http://localhost:9998/meta | grep xmpTPg:NPages
"xmpTPg:NPages","2"

对于页数,您需要xmpTPg:NPages,它基于XMP Paged-Text schema