使用itextsharp(或任何c#pdf库),如何打开PDF,替换一些文本,然后再次保存?

时间:2010-11-19 01:29:58

标签: c# pdf itextsharp acrobat

使用itextsharp(或任何c#pdf库),我需要打开一个PDF,用实际值替换一些占位符文本,并将其作为byte []返回。

有人可以建议怎么做吗?我已经看了一下itext文档,无法弄清楚从哪里开始。到目前为止,我仍然坚持如何将源PDF文件从PDFReader获取到Document对象,我认为我可能以错误的方式接近它。

非常感谢

3 个答案:

答案 0 :(得分:5)

最后,我使用PDFescape打开现有的PDF文件,并将一些表单字段放在我需要放置字段的位置,然后再次保存以创建我的PDF文件。

http://www.pdfescape.com

然后我找到了关于如何替换表单字段的博客文章:

http://www.johnnycode.com/blog/2010/03/05/using-a-template-to-programmatically-create-pdfs-with-c-and-itextsharp/

一切都很好用!这是代码:

public static byte[] Generate()
{
  var templatePath = HttpContext.Current.Server.MapPath("~/my_template.pdf");

  // Based on:
  // http://www.johnnycode.com/blog/2010/03/05/using-a-template-to-programmatically-create-pdfs-with-c-and-itextsharp/
  var reader = new PdfReader(templatePath);
  var outStream = new MemoryStream();
  var stamper = new PdfStamper(reader, outStream);

  var form = stamper.AcroFields;
  var fieldKeys = form.Fields.Keys;

  foreach (string fieldKey in fieldKeys)
  {
    if (form.GetField(fieldKey) == "MyTemplatesOriginalTextFieldA")
      form.SetField(fieldKey, "1234");
    if (form.GetField(fieldKey) == "MyTemplatesOriginalTextFieldB")
      form.SetField(fieldKey, "5678");
  }

  // "Flatten" the form so it wont be editable/usable anymore  
  stamper.FormFlattening = true;  

  stamper.Close();
  reader.Close();

  return outStream.ToArray();
}

答案 1 :(得分:1)

不幸的是,我一直在寻找类似的东西,但无法弄明白。以下是关于我得到的,也许你可以用这个作为起点。问题是PDF实际上并没有保存文本,而是使用查找表和其他一些神秘的魔法。这个方法读取页面的字节值并尝试转换为字符串,但据我所知,它只能做英文并且会遗漏一些特殊字符,所以我放弃了我的项目并继续前进。

string contents = string.Empty();
Document doc = new Document();
PdfReader reader = new PdfReader("pathToPdf.pdf");
using (MemoryStream memoryStream = new MemoryStream())
{

    PdfWriter writer = PdfWriter.GetInstance(doc, memoryStream);
    doc.Open();
    PdfContentByte cb = writer.DirectContent;
    for (int p = 1; p <= reader.NumberOfPages; p++)
    {
        // add page from reader
        doc.SetPageSize(reader.GetPageSize(p));
        doc.NewPage();

        // pickup here something like this:
        byte[] bt = reader.GetPageContent(p);
        contents = ExtractTextFromPDFBytes(bt);

        if (contents.IndexOf("something")!=-1)
        {
            // make your own pdf page and add to cb (contentbyte)

        }
        else
        {
            PdfImportedPage page = writer.GetImportedPage(reader, p);
            int rot = reader.GetPageRotation(p);
            if (rot == 90 || rot == 270)
                cb.AddTemplate(page, 0, -1.0F, 1.0F, 0, 0, reader.GetPageSizeWithRotation(p).Height);
            else
                cb.AddTemplate(page, 1.0F, 0, 0, 1.0F, 0, 0);
        }
    }
    reader.Close();
    doc.Close();
    File.WriteAllBytes("pathToOutputOrSamePathToOverwrite.pdf", memoryStream.ToArray());

这取自this site

private string ExtractTextFromPDFBytes(byte[] input) 
{ 
    if (input == null || input.Length == 0) return ""; 

     try 
     { 
         string resultString = ""; 

         // Flag showing if we are we currently inside a text object 
         bool inTextObject = false; 

         // Flag showing if the next character is literal  
         // e.g. '\\' to get a '\' character or '\(' to get '(' 
         bool nextLiteral = false; 

         // () Bracket nesting level. Text appears inside () 
         int bracketDepth = 0; 

         // Keep previous chars to get extract numbers etc.: 
         char[] previousCharacters = new char[_numberOfCharsToKeep]; 
         for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' '; 


          for (int i = 0; i < input.Length; i++) 
          { 
              char c = (char)input[i]; 

              if (inTextObject) 
              { 
                  // Position the text 
                  if (bracketDepth == 0) 
                  { 
                      if (CheckToken(new string[] { "TD", "Td" }, previousCharacters)) 
                      { 
                          resultString += "\n\r"; 
                      } 
                      else 
                      { 
                          if (CheckToken(new string[] { "'", "T*", "\"" }, previousCharacters)) 
                          { 
                               resultString += "\n"; 
                           } 
                           else 
                           { 
                               if (CheckToken(new string[] { "Tj" }, previousCharacters)) 
                                { 
                                    resultString += " "; 
                                } 
                            } 
                        } 
                    }

                    // End of a text object, also go to a new line. 
                    if (bracketDepth == 0 && 
                        CheckToken(new string[] { "ET" }, previousCharacters)) 
                    { 

                        inTextObject = false; 
                        resultString += " "; 
                   } 
                   else 
                   { 
                        // Start outputting text 
                        if ((c == '(') && (bracketDepth == 0) && (!nextLiteral)) 
                        { 
                            bracketDepth = 1; 
                        } 
                        else 
                        { 
                            // Stop outputting text 
                            if ((c == ')') && (bracketDepth == 1) && (!nextLiteral)) 
                            { 
                                 bracketDepth = 0; 
                            } 
                            else 
                            { 
                                // Just a normal text character: 
                                if (bracketDepth == 1) 
                                { 
                                    // Only print out next character no matter what.  
                                    // Do not interpret. 
                                    if (c == '\\' && !nextLiteral) 
                                    { 
                                        nextLiteral = true; 
                                    } 
                                    else 
                                    { 
                                        if (((c >= ' ') && (c <= '~')) || 
                                            ((c >= 128) && (c < 255))) 
                                        { 
                                            resultString += c.ToString(); 
                                        } 

                                        nextLiteral = false; 
                                    } 
                                } 
                            } 
                        } 
                    } 
                } 

                // Store the recent characters for  
                // when we have to go back for a checking 
                for (int j = 0; j < _numberOfCharsToKeep - 1; j++) 
                { 
                    previousCharacters[j] = previousCharacters[j + 1]; 
                } 
                previousCharacters[_numberOfCharsToKeep - 1] = c; 

                // Start of a text object 
                if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters)) 
                { 
                    inTextObject = true; 
                } 
            } 
        return resultString; 
    } 
    catch 
    { 
        return ""; 
     } 
} 

 private bool CheckToken(string[] tokens, char[] recent) 
 { 
     foreach (string token in tokens) 
     { 
         if ((recent[_numberOfCharsToKeep - 3] == token[0]) && 
           (recent[_numberOfCharsToKeep - 2] == token[1]) && 
           ((recent[_numberOfCharsToKeep - 1] == ' ') || 
           (recent[_numberOfCharsToKeep - 1] == 0x0d) || 
           (recent[_numberOfCharsToKeep - 1] == 0x0a)) && 
           ((recent[_numberOfCharsToKeep - 4] == ' ') || 
           (recent[_numberOfCharsToKeep - 4] == 0x0d) || 
           (recent[_numberOfCharsToKeep - 4] == 0x0a))) 
           { 
               return true; 
           } 
    }
    return false; 
} 

答案 2 :(得分:0)

我这里有一个python脚本,可替换PDF中的某些文本:

import re
import sys
import zlib

# Module to find and replace text in PDF files
#
# Usage:
#   python pdf_replace.py <input_filename> <text_to_find> <text_to_replace> <output_filename>
#
# @author Ionox0

input_filename = sys.argv[1]
text_to_find = sys.argv[2]
text_to_replace = sys.argv[3]
output_filename sys.argv[4]

pdf = open(input_filename, "rb").read()

# Create a copy of the PDF content to make edits to
pdf_copy = pdf[0:]

# Search for stream objects with text to replace
stream = re.compile(r'.*?FlateDecode.*?stream(.*?)endstream', re.S)

for s in stream.findall(pdf):
    s = s.strip('\r\n')

    try:
        text = zlib.decompress(s)

        if text_to_find in text:
            print('Found match:')
            print(text)

            text = text.replace(text_to_find, text_to_replace)
            pdf_copy = pdf_copy.replace(s, zlib.compress(text))
    except:
        pass

with open(output_filename, 'wb') as out:
    out.write(pdf_copy)