使用iTextSharp提取图像

时间:2009-04-29 13:02:24

标签: itextsharp

我一直在使用此代码取得巨大成功,以提取PDF的每个页面中找到的第一个图像。但是,由于未知原因,它现在无法使用某些新PDF。我已经使用了其他工具(Datalogics等),这些工具可以很好地利用这些新的PDF来提取图像。但是,如果我可以使用iTextSharp,我不想购买Datalogics或任何工具。任何人都可以告诉我为什么这段代码没有在PDF中找到图像?

Knowns:我的PDF每页只有1张图片,没有别的。

using iTextSharp.text;
using iTextSharp.text.pdf;
...
public static void ExtractImagesFromPDF(string sourcePdf, string outputPath)
{
    // NOTE:  This will only get the first image it finds per page.
    PdfReader pdf = new PdfReader(sourcePdf);
    RandomAccessFileOrArray raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf);

    try
    {
        for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++)
        {
            PdfDictionary pg = pdf.GetPageN(pageNumber);
            PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));

            PdfDictionary xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
            if (xobj != null)
            {
                foreach (PdfName name in xobj.Keys)
                {
                    PdfObject obj = xobj.Get(name);
                    if (obj.IsIndirect())
                    {
                        PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
                        PdfName type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
                        if (PdfName.IMAGE.Equals(type))
                        {
                            int XrefIndex = Convert.ToInt32(((PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
                            PdfObject pdfObj = pdf.GetPdfObject(XrefIndex);
                            PdfStream pdfStrem = (PdfStream)pdfObj;
                            byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)pdfStrem);
                            if ((bytes != null))
                            {
                                using (System.IO.MemoryStream memStream = new System.IO.MemoryStream(bytes))
                                {
                                    memStream.Position = 0;
                                    System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
                                    // must save the file while stream is open.
                                    if (!Directory.Exists(outputPath))
                                        Directory.CreateDirectory(outputPath);

                                    string path = Path.Combine(outputPath, String.Format(@"{0}.jpg", pageNumber));
                                    System.Drawing.Imaging.EncoderParameters parms = new System.Drawing.Imaging.EncoderParameters(1);
                                    parms.Param[0] = new System.Drawing.Imaging.EncoderParameter(System.Drawing.Imaging.Encoder.Compression, 0);
                                    System.Drawing.Imaging.ImageCodecInfo jpegEncoder = Utilities.GetImageEncoder("JPEG");
                                    img.Save(path, jpegEncoder, parms);
                                    break;
                                }
                            }
                        }
                    }
                }
            }
        }
    }

    catch
    {
        throw;
    }
    finally
    {
        pdf.Close();
        raf.Close();
    }
}

6 个答案:

答案 0 :(得分:20)

我发现我的问题是我没有递归地搜索表单和组中的图像。基本上,原始代码只能找到嵌入在pdf文档根目录下的图像。以下是修订后的方法以及递归搜索页面中图像的新方法(FindImageInPDFDictionary)。注意:仅支持JPEG和非压缩图像的缺陷仍然适用。有关修复这些缺陷的选项,请参阅R Ubben的代码。 HTH某人。

    public static void ExtractImagesFromPDF(string sourcePdf, string outputPath)
    {
        // NOTE:  This will only get the first image it finds per page.
        PdfReader pdf = new PdfReader(sourcePdf);
        RandomAccessFileOrArray raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf);

        try
        {
            for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++)
            {
                PdfDictionary pg = pdf.GetPageN(pageNumber);

                // recursively search pages, forms and groups for images.
                PdfObject obj = FindImageInPDFDictionary(pg);
                if (obj != null)
                {

                    int XrefIndex = Convert.ToInt32(((PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
                    PdfObject pdfObj = pdf.GetPdfObject(XrefIndex);
                    PdfStream pdfStrem = (PdfStream)pdfObj;
                    byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)pdfStrem);
                    if ((bytes != null))
                    {
                        using (System.IO.MemoryStream memStream = new System.IO.MemoryStream(bytes))
                        {
                            memStream.Position = 0;
                            System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
                            // must save the file while stream is open.
                            if (!Directory.Exists(outputPath))
                                Directory.CreateDirectory(outputPath);

                            string path = Path.Combine(outputPath, String.Format(@"{0}.jpg", pageNumber));
                            System.Drawing.Imaging.EncoderParameters parms = new System.Drawing.Imaging.EncoderParameters(1);
                            parms.Param[0] = new System.Drawing.Imaging.EncoderParameter(System.Drawing.Imaging.Encoder.Compression, 0);
                            System.Drawing.Imaging.ImageCodecInfo jpegEncoder = Utilities.GetImageEncoder("JPEG");
                            img.Save(path, jpegEncoder, parms);
                        }
                    }
                }
            }
        }
        catch
        {
            throw;
        }
        finally
        {
            pdf.Close();
            raf.Close();
        }


    }

     private static PdfObject FindImageInPDFDictionary(PdfDictionary pg)
    {
        PdfDictionary res =
            (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));


        PdfDictionary xobj =
          (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
        if (xobj != null)
        {
            foreach (PdfName name in xobj.Keys)
            {

                PdfObject obj = xobj.Get(name);
                if (obj.IsIndirect())
                {
                    PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);

                    PdfName type =
                      (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));

                    //image at the root of the pdf
                    if (PdfName.IMAGE.Equals(type))
                    {
                        return obj;
                    }// image inside a form
                    else if (PdfName.FORM.Equals(type))
                    {
                        return FindImageInPDFDictionary(tg);
                    } //image inside a group
                    else if (PdfName.GROUP.Equals(type))
                    {
                        return FindImageInPDFDictionary(tg);
                    }

                }
            }
        }

        return null;

    }

答案 1 :(得分:8)

这是一个更简单的解决方案:

iTextSharp.text.pdf.parser.PdfImageObject pdfImage = 
                            new iTextSharp.text.pdf.parser.PdfImageObject(imgPRStream);
                        System.Drawing.Image img = pdfImage.GetDrawingImage();

答案 2 :(得分:6)

以下代码包含了上述所有Dave和R Ubben的想法,并且它返回了所有图像的完整列表,并且还处理了多个位深度。我不得不将它转换为VB用于我正在进行的项目,抱歉...

Private Sub getAllImages(ByVal dict As pdf.PdfDictionary, ByVal images As List(Of Byte()), ByVal doc As pdf.PdfReader)
    Dim res As pdf.PdfDictionary = CType(pdf.PdfReader.GetPdfObject(dict.Get(pdf.PdfName.RESOURCES)), pdf.PdfDictionary)
    Dim xobj As pdf.PdfDictionary = CType(pdf.PdfReader.GetPdfObject(res.Get(pdf.PdfName.XOBJECT)), pdf.PdfDictionary)

    If xobj IsNot Nothing Then
        For Each name As pdf.PdfName In xobj.Keys
            Dim obj As pdf.PdfObject = xobj.Get(name)
            If (obj.IsIndirect) Then
                Dim tg As pdf.PdfDictionary = CType(pdf.PdfReader.GetPdfObject(obj), pdf.PdfDictionary)
                Dim subtype As pdf.PdfName = CType(pdf.PdfReader.GetPdfObject(tg.Get(pdf.PdfName.SUBTYPE)), pdf.PdfName)
                If pdf.PdfName.IMAGE.Equals(subtype) Then
                    Dim xrefIdx As Integer = CType(obj, pdf.PRIndirectReference).Number
                    Dim pdfObj As pdf.PdfObject = doc.GetPdfObject(xrefIdx)
                    Dim str As pdf.PdfStream = CType(pdfObj, pdf.PdfStream)
                    Dim bytes As Byte() = pdf.PdfReader.GetStreamBytesRaw(CType(str, pdf.PRStream))

                    Dim filter As String = tg.Get(pdf.PdfName.FILTER).ToString
                    Dim width As String = tg.Get(pdf.PdfName.WIDTH).ToString
                    Dim height As String = tg.Get(pdf.PdfName.HEIGHT).ToString
                    Dim bpp As String = tg.Get(pdf.PdfName.BITSPERCOMPONENT).ToString

                    If filter = "/FlateDecode" Then
                        bytes = pdf.PdfReader.FlateDecode(bytes, True)
                        Dim pixelFormat As System.Drawing.Imaging.PixelFormat
                        Select Case Integer.Parse(bpp)
                            Case 1
                                pixelFormat = Drawing.Imaging.PixelFormat.Format1bppIndexed
                            Case 24
                                pixelFormat = Drawing.Imaging.PixelFormat.Format24bppRgb
                            Case Else
                                Throw New Exception("Unknown pixel format " + bpp)
                        End Select
                        Dim bmp As New System.Drawing.Bitmap(Int32.Parse(width), Int32.Parse(height), pixelFormat)
                        Dim bmd As System.Drawing.Imaging.BitmapData = bmp.LockBits(New System.Drawing.Rectangle(0, 0, Int32.Parse(width), Int32.Parse(height)), System.Drawing.Imaging.ImageLockMode.WriteOnly, pixelFormat)
                        Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length)
                        bmp.UnlockBits(bmd)
                        Using ms As New MemoryStream
                            bmp.Save(ms, System.Drawing.Imaging.ImageFormat.Png)
                            bytes = ms.GetBuffer
                        End Using
                    End If
                    images.Add(bytes)
                ElseIf pdf.PdfName.FORM.Equals(subtype) Or pdf.PdfName.GROUP.Equals(subtype) Then
                    getAllImages(tg, images, doc)
                End If
            End If
        Next
    End If
End Sub

答案 3 :(得分:4)

De c#version:

private IList<System.Drawing.Image> GetImagesFromPdfDict(PdfDictionary dict, PdfReader doc){
        List<System.Drawing.Image> images = new List<System.Drawing.Image>();
        PdfDictionary res = (PdfDictionary)(PdfReader.GetPdfObject(dict.Get(PdfName.RESOURCES)));
        PdfDictionary xobj = (PdfDictionary)(PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT)));

        if (xobj != null)
        {
            foreach (PdfName name in xobj.Keys)
            {
                PdfObject obj = xobj.Get(name);
                if (obj.IsIndirect())
                {
                    PdfDictionary tg = (PdfDictionary)(PdfReader.GetPdfObject(obj));
                    pdf.PdfName subtype = (pdf.PdfName)(pdf.PdfReader.GetPdfObject(tg.Get(pdf.PdfName.SUBTYPE)));
                    if (pdf.PdfName.IMAGE.Equals(subtype))
                    {
                        int xrefIdx = ((pdf.PRIndirectReference)obj).Number;
                        pdf.PdfObject pdfObj = doc.GetPdfObject(xrefIdx);
                        pdf.PdfStream str = (pdf.PdfStream)(pdfObj);
                        byte[] bytes = pdf.PdfReader.GetStreamBytesRaw((pdf.PRStream)str);

                        string filter = tg.Get(pdf.PdfName.FILTER).ToString();
                        string width = tg.Get(pdf.PdfName.WIDTH).ToString();
                        string height = tg.Get(pdf.PdfName.HEIGHT).ToString();
                        string bpp = tg.Get(pdf.PdfName.BITSPERCOMPONENT).ToString();

                        if (filter == "/FlateDecode")
                        {
                            bytes = pdf.PdfReader.FlateDecode(bytes, true);
                            System.Drawing.Imaging.PixelFormat pixelFormat;
                            switch (int.Parse(bpp))
                            {
                                case 1:
                                    pixelFormat = System.Drawing.Imaging.PixelFormat.Format1bppIndexed;
                                    break;
                                case 24:
                                    pixelFormat = System.Drawing.Imaging.PixelFormat.Format24bppRgb;
                                    break;
                                default:
                                    throw new Exception("Unknown pixel format " + bpp);
                            }
                            var bmp = new System.Drawing.Bitmap(Int32.Parse(width), Int32.Parse(height), pixelFormat);
                            System.Drawing.Imaging.BitmapData bmd = bmp.LockBits(new System.Drawing.Rectangle(0, 0, Int32.Parse(width),
                                Int32.Parse(height)), System.Drawing.Imaging.ImageLockMode.WriteOnly, pixelFormat);
                            Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length);
                            bmp.UnlockBits(bmd);
                            using (var ms = new MemoryStream())
                            {
                                bmp.Save(ms, System.Drawing.Imaging.ImageFormat.Png);
                                bytes = ms.GetBuffer();
                            }
                        }
                        images.Add(System.Drawing.Image.FromStream(new MemoryStream(bytes)));
                    }
                    else if (pdf.PdfName.FORM.Equals(subtype) || pdf.PdfName.GROUP.Equals(subtype))
                    {
                        images.AddRange(GetImagesFromPdfDict(tg, doc));
                    }
                }
            }
        }
        return images;
     }

答案 4 :(得分:4)

这只是对他人思想的又一次重述,但却是对我有用的。在这里,我使用@Malco的图像抓取片段与R Ubben的循环:

private IList<System.Drawing.Image> GetImagesFromPdfDict(PdfDictionary dict, PdfReader doc)
{
    List<System.Drawing.Image> images = new List<System.Drawing.Image>();
    PdfDictionary res = (PdfDictionary)(PdfReader.GetPdfObject(dict.Get(PdfName.RESOURCES)));
    PdfDictionary xobj = (PdfDictionary)(PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT)));

    if (xobj != null)
    {
        foreach (PdfName name in xobj.Keys)
        {
            PdfObject obj = xobj.Get(name);
            if (obj.IsIndirect())
            {
                PdfDictionary tg = (PdfDictionary)(PdfReader.GetPdfObject(obj));
                PdfName subtype = (PdfName)(PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE)));
                if (PdfName.IMAGE.Equals(subtype))
                {
                    int xrefIdx = ((PRIndirectReference)obj).Number;
                    PdfObject pdfObj = doc.GetPdfObject(xrefIdx);
                    PdfStream str = (PdfStream)(pdfObj);

                    iTextSharp.text.pdf.parser.PdfImageObject pdfImage =
                        new iTextSharp.text.pdf.parser.PdfImageObject((PRStream)str);
                    System.Drawing.Image img = pdfImage.GetDrawingImage();

                    images.Add(img);
                }
                else if (PdfName.FORM.Equals(subtype) || PdfName.GROUP.Equals(subtype))
                {
                    images.AddRange(GetImagesFromPdfDict(tg, doc));
                }
            }
        }
    }

    return images;
}

答案 5 :(得分:3)

以上内容仅适用于JPEG。排除内嵌图像和嵌入文件,您需要浏览子类型IMAGE的对象,然后查看过滤器并采取适当的操作。这是一个例子,假设我们有一个子类型为IMAGE的PdfObject:

            PdfReader pdf = new PdfReader("c:\\temp\\exp0.pdf");
        int xo=pdf.XrefSize;
        for (int i=0;i<xo;i++)
        {
            PdfObject obj=pdf.GetPdfObject(i);
            if (obj!=null && obj.IsStream())
            {
                PdfDictionary pd=(PdfDictionary)obj;
                if (pd.Contains(PdfName.SUBTYPE) && pd.Get(PdfName.SUBTYPE).ToString()=="/Image")
                {
                    string filter=pd.Get(PdfName.FILTER).ToString();
                    string width=pd.Get(PdfName.WIDTH).ToString();
                    string height=pd.Get(PdfName.HEIGHT).ToString();
                    string bpp=pd.Get(PdfName.BITSPERCOMPONENT).ToString();
                    string extent=".";
                    byte [] img=null;
                    switch (filter)
                    {
                        case "/FlateDecode":
                            byte[] arr=PdfReader.FlateDecode(PdfReader.GetStreamBytesRaw((PRStream)obj),true);
                            Bitmap bmp=new Bitmap(Int32.Parse(width),Int32.Parse(height),PixelFormat.Format24bppRgb);
                            BitmapData bmd=bmp.LockBits(new Rectangle(0,0,Int32.Parse(width),Int32.Parse(height)),ImageLockMode.WriteOnly,
                                PixelFormat.Format24bppRgb);
                            Marshal.Copy(arr,0,bmd.Scan0,arr.Length);
                            bmp.UnlockBits(bmd);
                            bmp.Save("c:\\temp\\bmp1.png",ImageFormat.Png);
                            break;
                        default:
                            break;
                    }
                }
            }
        }

当然,由于Microsoft BGR,这会弄乱颜色,但我想保持简短。为“/ CCITTFaxDecode”等做类似的事情。

相关问题