
时间:2015-01-09 23:57:51

标签: c# .net pdf itextsharp itext


enter image description here

我有几百个包含此文件的PDF文档"简介"页面,通常在文档的第五页或第六页。该段始终以一个首字母开头,例如大的P in" Physical"在示例中。


private static string GetIntroductionText( string filePath )
    using ( var reader = new PdfReader( filePath ) )
        var appending = false;
        var introText = new StringBuilder();

        for ( var i = 1; i <= reader.NumberOfPages; i++ )
            var pageText = PdfTextExtractor.GetTextFromPage( reader, i );

            if ( pageText.Trim().StartsWith( "Introduction" ) )
                appending = true;

            if ( pageText.Trim().StartsWith( "Chapter" ) )

            if ( appending )
                introText.Append( pageText );

        return introText.ToString();


hysical reality is consistent with universal laws. Where the laws do not operate, there is no reality. All of this...is unreal.



var pageText = PdfTextExtractor.GetTextFromPage( reader, i, new LocationTextExtractionStrategy() );


1 个答案:

答案 0 :(得分:0)

为了记录,这是我在查看iText源代码(特别是LocationTextExtractionStrategy class)之后解决这个问题的方法。请记住,(0,0)坐标位于页面的左下角,而不是左上角。

public class ChunkExtractionStrategy : ITextExtractionStrategy
    public List<Chunk> Chunks = new List<Chunk>();

    public void BeginTextBlock()

    public void EndTextBlock()

    public string GetResultantText()
        var text = new StringBuilder();


        Chunk prevChunk = null;

        foreach ( var chunk in Chunks )
            if ( prevChunk == null && string.IsNullOrWhiteSpace( chunk.Text ) )
                // blank space at beginning of page

            if ( prevChunk != null && !chunk.SameLine( prevChunk, 20 ) )
                text.Append( "\n\n" );

            text.Append( chunk.Text );

            prevChunk = chunk;

        return text.ToString();

    public void RenderImage( ImageRenderInfo renderInfo )

    public void RenderText( TextRenderInfo renderInfo )
        Chunks.Add( new Chunk
                            TopLeft = renderInfo.GetAscentLine().GetStartPoint(),
                            BottomRight = renderInfo.GetDescentLine().GetEndPoint(),
                            Text = renderInfo.GetText(),
                        } );

    public class Chunk : IComparable<Chunk>
        public Vector TopLeft { get; set; }

        public Vector BottomRight { get; set; }

        public string Text { get; set; }

        public int CompareTo( Chunk other )
            var y1 = (int)Math.Round( TopLeft[1] );
            var y2 = (int)Math.Round( other.TopLeft[1] );

            if ( y1 < y2 )
                return 1;

            if ( y1 > y2 )
                return -1;

            var x1 = (int)Math.Round( TopLeft[0] );
            var x2 = (int)Math.Round( other.TopLeft[0] );

            if ( x1 < x2 )
                return -1;

            if ( x1 > x2 )
                return 1;

            return 0;

        public bool SameLine( Chunk other, int maxDiff = 0 )
            var diff = Math.Abs( TopLeft[1] - other.TopLeft[1] );

            return diff <= maxDiff;

首先,我尝试了与this answer类似的内容。但后来我发现自己压倒了课堂上的所有内容,所以创建一个新实现更有意义。
