如何用PyPDF2提取TOC?

时间:2018-01-08 19:53:42

标签: pdf pypdf2

this pdf为例。我可以用dumppdf.py -T 1707.09725.pdf

来提取目录(TOC)
<outlines>
    <outline level="1" title="1 Introduction">
        <dest>
            <list size="5">
                <ref id="513"/>
                <literal>XYZ</literal>
                <number>99.213</number>
                <number>742.911</number>
                <null/>
            </list>
        </dest>
        <pageno>14</pageno>
    </outline>
    <outline level="1" title="2 Convolutional Neural Networks">
        <dest>
            <list size="5">
                <ref id="554"/>
                <literal>XYZ</literal>
                <number>99.213</number>
                <number>742.911</number>
                <null/>
            </list>
        </dest>
        <pageno>16</pageno>
    </outline>
...

我可以用PyPDF2做类似的事情吗?

2 个答案:

答案 0 :(得分:1)

找到它:

from PyPDF2 import PdfFileReader

reader = PdfFileReader(open("1707.09725.pdf", 'rb'))

print(reader.outlines)

给出:

[{'/Title': '1 Introduction', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(513, 0)},
 {'/Title': '2 Convolutional Neural Networks', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(554, 0)}, [{'/Title': '2.1 Linear Image Filters', '/Left': 99.213, '/Type': '/XYZ', '/Top': 486.791, '/Zoom': ..., '/Page': IndirectObject(554, 0)},
 {'/Title': '2.2 CNN Layer Types', '/Left': 70.866, '/Type': '/XYZ', '/Top': 316.852, '/Zoom': ..., '/Page': IndirectObject(580, 0)},
[{'/Title': '2.2.1 Convolutional Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 562.722, '/Zoom': ..., '/Page': IndirectObject(608, 0)},
 {'/Title': '2.2.2 Pooling Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 299.817, '/Zoom': ..., '/Page': IndirectObject(654, 0)},
 {'/Title': '2.2.3 Dropout', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(689, 0)},
 {'/Title': '2.2.4 Normalization Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 193.779, '/Zoom': <PyPDF2.generic.NullObject object at 0x7fbe49d14350>, '/Page': IndirectObject(689, 0)}]

答案 1 :(得分:1)

或者,根据this answer的建议,您可以使用pikepdf

from pikepdf import Pdf

path = "path/to/file.pdf"

with Pdf.open(path) as pdf:
    outline = pdf.open_outline()
    for title in outline.root:
        print(title)
        for subtitle in title.children:
            print('\t', subtitle)
相关问题