我尝试使用iTextSharp从PDF文档中提取文本。我感兴趣的文字出现在"简介"以下示例中的标题:
我有几百个包含此文件的PDF文档"简介"页面,通常在文档的第五页或第六页。该段始终以一个首字母开头,例如大的P in" Physical"在示例中。
在下面的代码中,我扫描文档以查找以文本"简介"开头的页面。然后我提取文本直到下一个标题("第1章"):
private static string GetIntroductionText( string filePath )
{
using ( var reader = new PdfReader( filePath ) )
{
var appending = false;
var introText = new StringBuilder();
for ( var i = 1; i <= reader.NumberOfPages; i++ )
{
var pageText = PdfTextExtractor.GetTextFromPage( reader, i );
if ( pageText.Trim().StartsWith( "Introduction" ) )
{
appending = true;
}
if ( pageText.Trim().StartsWith( "Chapter" ) )
{
break;
}
if ( appending )
{
introText.Append( pageText );
}
}
return introText.ToString();
}
}
问题在于它没有提取初始值,即物体&#34;中的P值。所以文字是:
hysical reality is consistent with universal laws. Where the laws do not operate, there is no reality. All of this...is unreal.
如何在文本开头获取首字母?
我认为可能涉及使用LocationTextExtractionStrategy
,所以:
var pageText = PdfTextExtractor.GetTextFromPage( reader, i, new LocationTextExtractionStrategy() );
不幸的是,这产生了同样的结果。
答案 0 :(得分:0)
为了记录,这是我在查看iText源代码(特别是LocationTextExtractionStrategy class)之后解决这个问题的方法。请记住,(0,0)坐标位于页面的左下角,而不是左上角。
public class ChunkExtractionStrategy : ITextExtractionStrategy
{
public List<Chunk> Chunks = new List<Chunk>();
public void BeginTextBlock()
{}
public void EndTextBlock()
{}
public string GetResultantText()
{
var text = new StringBuilder();
Chunks.Sort();
Chunk prevChunk = null;
foreach ( var chunk in Chunks )
{
if ( prevChunk == null && string.IsNullOrWhiteSpace( chunk.Text ) )
{
// blank space at beginning of page
continue;
}
if ( prevChunk != null && !chunk.SameLine( prevChunk, 20 ) )
{
text.Append( "\n\n" );
}
text.Append( chunk.Text );
prevChunk = chunk;
}
return text.ToString();
}
public void RenderImage( ImageRenderInfo renderInfo )
{}
public void RenderText( TextRenderInfo renderInfo )
{
Chunks.Add( new Chunk
{
TopLeft = renderInfo.GetAscentLine().GetStartPoint(),
BottomRight = renderInfo.GetDescentLine().GetEndPoint(),
Text = renderInfo.GetText(),
} );
}
public class Chunk : IComparable<Chunk>
{
public Vector TopLeft { get; set; }
public Vector BottomRight { get; set; }
public string Text { get; set; }
public int CompareTo( Chunk other )
{
var y1 = (int)Math.Round( TopLeft[1] );
var y2 = (int)Math.Round( other.TopLeft[1] );
if ( y1 < y2 )
{
return 1;
}
if ( y1 > y2 )
{
return -1;
}
var x1 = (int)Math.Round( TopLeft[0] );
var x2 = (int)Math.Round( other.TopLeft[0] );
if ( x1 < x2 )
{
return -1;
}
if ( x1 > x2 )
{
return 1;
}
return 0;
}
public bool SameLine( Chunk other, int maxDiff = 0 )
{
var diff = Math.Abs( TopLeft[1] - other.TopLeft[1] );
return diff <= maxDiff;
}
}
}
首先,我尝试了与this answer类似的内容。但后来我发现自己压倒了课堂上的所有内容,所以创建一个新实现更有意义。