Question

在下面提到的pdf中，我尝试使用下面的代码提取文本。但是有些文本没有准确提取。例如，在pdf中引用了buffen这个词。在提取的文本中，它将被提取为bufen。因为当我在adobe中应用所选文本时，buffeen中的ff被选为单个字符。

iTextSharp中是否还有其他DLL或SDK从pdf中提取文本？

还有什么想法得到每个单词的坐标吗？我需要在xml文件中编写坐标。

文字pdf：https://docs.google.com/file/d/0B3ZAyYMW9DEMSmlCcEVVT0ZsLWc/edit?usp=sharing 提取的文字：https://docs.google.com/file/d/0B3ZAyYMW9DEMTFhmaEdVNlRabkk/edit?usp=sharing

Public Sub GetPDFText(ByVal pdfpath As String)
Dim reader As New PdfReader(pdfpath)
Dim output As New StringWriter()
For i As Integer = 1 To reader.NumberOfPages
    output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, New SimpleTextExtractionStrategy()))
Next
pdftext.Text = output.ToString
Dim filenam As String = "D:\Temp\itext\test.txt"
Dim testss As New System.IO.StreamWriter(filenam)
testss.Write(pdftext.Text)
testss.Close()
    End Sub

如何解决iTextSharp中的文本提取痛点

0 个答案: