使用Python3和pdfrw编辑PDF元数据字段

时间:2019-05-21 07:39:06

标签: python-3.x pdf metadata pdfrw

我正在尝试编辑PDF的元数据Title字段,以尽可能包含ASCII等效项。我正在使用Python3和模块pdfrw

如何执行替换元数据字段的字符串操作?

我的测试代码在这里:

from pdfrw import PdfReader, PdfWriter, PdfString
import unicodedata

def edit_title_metadata(inpdf):

    trailer = PdfReader(inpdf)

    # this statement is breaking pdfrw
    trailer.Info.Title = unicode_normalize(trailer.Info.Title)

    # also have tried:
    #trailer.Info.Title = PdfString(unicode_normalize(trailer.Info.Title))

    PdfWriter("test.pdf", trailer=trailer).write()
    return

def unicode_normalize(s):
    return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')

if __name__ == "__main__":

    edit_title_metadata('Anadon-2011-Scientific Opinion on the safety e.pdf')

回溯为:

Traceback (most recent call last):
  File "get_metadata.py", line 68, in <module>
    main()
  File "get_metadata.py", line 54, in main
    edit_title_metadata(pdf)
  File "get_metadata.py", line 11, in edit_title_metadata
    trailer.Info.Title = PdfString(unicode_normalize(trailer.Info.Title))
  File "get_metadata.py", line 18, in unicode_normalize
    return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
  File "/path_to_python/python3.7/site-packages/pdfrw/objects/pdfstring.py", line 550, in encode
    if isinstance(source, uni_type):
TypeError: isinstance() arg 2 must be a type or tuple of types

注意:

    GitHub上的
  • This issue可能与此相关。

  • FWIW,Python3.6也出现相同的错误

  • 我已经共享了pdf文件(其中包含非ASCII连字符,Unicode字符\ u2010)

 wget https://gist.github.com/philshem/71507d4e8ecfabad252fbdf4d9f8bdd2/raw/cce346ab39dd6ecb3a718ad3f92c9f546761e87b/Anadon-2011-Scientific%2520Opinion%2520on%2520the%2520safety%2520e.pdf

1 个答案:

答案 0 :(得分:0)

您必须在元数据字段上使用.decode()方法:

trailer.Info.Title = unicode_normalize(trailer.Info.Title.decode())

以及完整的工作代码:

from pdfrw import PdfReader, PdfWriter, PdfReader
import unicodedata

def edit_title_metadata(inpdf):

    trailer = PdfReader(inpdf)
    trailer.Info.Title = unicode_normalize(trailer.Info.Title.decode())
    PdfWriter("test.pdf", trailer=trailer).write()
    return

def unicode_normalize(s):
    return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')

if __name__ == "__main__":

    edit_title_metadata('Anadon-2011-Scientific Opinion on the safety e.pdf')