Question

我尝试使用iTextSharp从PDF文档中提取文本。我感兴趣的文字出现在＆＃34;简介＆＃34;以下示例中的标题：

enter image description here

我有几百个包含此文件的PDF文档＆＃34;简介＆＃34;页面，通常在文档的第五页或第六页。该段始终以一个首字母开头，例如大的P in＆＃34; Physical＆＃34;在示例中。

在下面的代码中，我扫描文档以查找以文本＆＃34;简介＆＃34;开头的页面。然后我提取文本直到下一个标题（＆＃34;第1章＆＃34;）：

private static string GetIntroductionText( string filePath )
{
    using ( var reader = new PdfReader( filePath ) )
    {
        var appending = false;
        var introText = new StringBuilder();

        for ( var i = 1; i <= reader.NumberOfPages; i++ )
        {
            var pageText = PdfTextExtractor.GetTextFromPage( reader, i );

            if ( pageText.Trim().StartsWith( "Introduction" ) )
            {
                appending = true;
            }

            if ( pageText.Trim().StartsWith( "Chapter" ) )
            {
                break;
            }

            if ( appending )
            {
                introText.Append( pageText );
            }
        }

        return introText.ToString();
    }
}

问题在于它没有提取初始值，即物体＆＃34;中的P值。所以文字是：

hysical reality is consistent with universal laws. Where the laws do not operate, there is no reality. All of this...is unreal.

如何在文本开头获取首字母？

我认为可能涉及使用LocationTextExtractionStrategy，所以：

var pageText = PdfTextExtractor.GetTextFromPage( reader, i, new LocationTextExtractionStrategy() );

不幸的是，这产生了同样的结果。

Answer 1

为了记录，这是我在查看iText源代码（特别是LocationTextExtractionStrategy class）之后解决这个问题的方法。请记住，（0,0）坐标位于页面的左下角，而不是左上角。

public class ChunkExtractionStrategy : ITextExtractionStrategy
{
    public List<Chunk> Chunks = new List<Chunk>();

    public void BeginTextBlock()
    {}

    public void EndTextBlock()
    {}

    public string GetResultantText()
    {
        var text = new StringBuilder();

        Chunks.Sort();

        Chunk prevChunk = null;

        foreach ( var chunk in Chunks )
        {
            if ( prevChunk == null && string.IsNullOrWhiteSpace( chunk.Text ) )
            {
                // blank space at beginning of page
                continue;
            }

            if ( prevChunk != null && !chunk.SameLine( prevChunk, 20 ) )
            {
                text.Append( "\n\n" );
            }

            text.Append( chunk.Text );

            prevChunk = chunk;
        }

        return text.ToString();
    }

    public void RenderImage( ImageRenderInfo renderInfo )
    {}

    public void RenderText( TextRenderInfo renderInfo )
    {
        Chunks.Add( new Chunk
                        {
                            TopLeft = renderInfo.GetAscentLine().GetStartPoint(),
                            BottomRight = renderInfo.GetDescentLine().GetEndPoint(),
                            Text = renderInfo.GetText(),
                        } );
    }

    public class Chunk : IComparable<Chunk>
    {
        public Vector TopLeft { get; set; }

        public Vector BottomRight { get; set; }

        public string Text { get; set; }

        public int CompareTo( Chunk other )
        {
            var y1 = (int)Math.Round( TopLeft[1] );
            var y2 = (int)Math.Round( other.TopLeft[1] );

            if ( y1 < y2 )
            {
                return 1;
            }

            if ( y1 > y2 )
            {
                return -1;
            }

            var x1 = (int)Math.Round( TopLeft[0] );
            var x2 = (int)Math.Round( other.TopLeft[0] );

            if ( x1 < x2 )
            {
                return -1;
            }

            if ( x1 > x2 )
            {
                return 1;
            }

            return 0;
        }

        public bool SameLine( Chunk other, int maxDiff = 0 )
        {
            var diff = Math.Abs( TopLeft[1] - other.TopLeft[1] );

            return diff <= maxDiff;
        }
    }
}

首先，我尝试了与this answer类似的内容。但后来我发现自己压倒了课堂上的所有内容，所以创建一个新实现更有意义。

如何在段落开头获得首字母？

1 个答案: