如何在PDF文本提取过程中忽略隐藏文本

时间:2015-09-23 20:43:28

标签: c# pdf itextsharp

链接的PDF包含以某种方式隐藏的文本,但是当我尝试提取特定区域时,我会获得所有可见和不可见的文本。我只想要可见的文字。有什么建议吗?

我正在使用iTextSharp v5.5.7。

代码示例:

static void Main(string[] args)
{
    string pdffn = args[0];
    //Hard coded coordinates specific to the linked PDF
    float llx = 664.0f;
    float lly = 1512.0f - 1472.0f;
    float urx = 890.0f;
    float ury = 1512.0f - 1277.0f;
    iTextSharp.text.Rectangle r = new iTextSharp.text.Rectangle(llx, lly, urx, ury);
    string TextInRect = GetParagraphByRectangle(pdffn, 1, r);
}

/// <summary>
///This extracts text in a given rectangle of the PDF.  Could be handy to extract text in an event.
///This rectangle is execpted to be llx, lly, urx, ury
/// </summary>
/// <param name="pdffn"></param>
/// <param name="pageno"></param>
/// <param name="rect"></param>
/// <returns></returns>
public string GetParagraphByRectangle(string pdffn, int pageno, iTextSharp.text.Rectangle rect)
{
    PdfReader reader = new PdfReader(pdffn);
    RenderFilter[] renderFilter = new RenderFilter[1];
    renderFilter[0] = new RegionTextRenderFilter(rect);
    ITextExtractionStrategy textExtractionStrategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
    string text = PdfTextExtractor.GetTextFromPage(reader, pageno, textExtractionStrategy);
    return text;
}

我已经跟踪了代码并在调用RenderText()时检查了TextRenderInfo对象,但是我没有找到任何线索来指示文本是否以某种方式被屏蔽以使其不可见。

0 个答案:

没有答案