从扫描的文档中读取图像PDF

时间:2017-03-17 07:53:27

标签: c# pdf itext

我使用itextsharp使用c#从PDF中提取内容,如下所示

  public static string GetTextFromAllPages(String pdfPath)
        {
            PdfReader reader = new PdfReader(pdfPath);

            StringWriter output = new StringWriter();

            for (int i = 1; i <= reader.NumberOfPages; i++)
                output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));


            return output.ToString();
        }

现在,只要PDF中有图像,我想在此代码中进行更改,它应在内容中包含图像标记(<img>)。

我尝试单独提取图像并且我能够做到但不确定如何将这两个代码合并在一起以使提取的内容也包含img标签。

提取图像代码如下:

private static List<System.Drawing.Image> ExtractImages(String PDFSourcePath)
        {

            //string res = GetTextFromAllPages(PDFSourcePath);
            //File.WriteAllText(@"d:\blobfile\blobfileresult.txt", res);
            List<System.Drawing.Image> ImgList = new List<System.Drawing.Image>();

            iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null;
            iTextSharp.text.pdf.PdfReader PDFReaderObj = null;
            iTextSharp.text.pdf.PdfObject PDFObj = null;
            iTextSharp.text.pdf.PdfStream PDFStremObj = null;

            try
            {
                RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(PDFSourcePath);
                PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null);
                if (PDFReaderObj.IsOpenedWithFullPermissions)
                {
                    Console.WriteLine("this is a test");
                }

                for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++)
                {
                    PDFObj = PDFReaderObj.GetPdfObject(i);

                    if ((PDFObj != null) && PDFObj.IsStream())
                    {
                        PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj;
                        iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE);

                        if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
                       // if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.CCITTFAXDECODE.ToString())
                        {
                            byte[] bytes = iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw((iTextSharp.text.pdf.PRStream)PDFStremObj);

                            if ((bytes != null))
                            {
                                try
                                {
                                    System.IO.MemoryStream MS = new System.IO.MemoryStream(bytes);

                                    MS.Position = 0;
                                    System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS);

                                    ImgList.Add(ImgPDF);

                                }
                                catch (Exception e)
                                {
                                    Console.WriteLine("Exception in extract: " + e);
                                }
                            }
                        }
                    }
                }
                PDFReaderObj.Close();
            }
            catch (Exception ex)
            {
                throw new Exception(ex.Message);
            }
            return ImgList;
        }

0 个答案:

没有答案