我需要使用Data lake Analytics从pdf文件中提取数据并将值存储到表中。谁能帮我提供一些有关如何实现此方案的示例或过程。
答案 0 :(得分:1)
以下是Azure Data Lake Analytics中开始使用U-SQL的一些资源:
https://docs.microsoft.com/en-us/u-sql/
https://www.purplefrogsystems.com/paul/category/u-sql/
https://www.mssqltips.com/sqlservertip/5890/azure-data-lake-analytics-using-usql-queries/
关于您所讨论的场景,您必须编写一个Custom Extractor来阅读PDF。这是相同的C#示例:
using System.Collections.Generic;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using Microsoft.Analytics.Interfaces;
namespace PDFExtractor
{
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class PDFExtractor : IExtractor
{
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
{
var reader = new PdfReader(input.BaseStream);
for (var page = 1; page <= reader.NumberOfPages; page++)
{
output.Set(0, page);
output.Set(1, ExtractText(reader, page));
yield return output.AsReadOnly();
}
}
public string ExtractText(PdfReader pdfReader, int pageNum)
{
var text = PdfTextExtractor.GetTextFromPage(pdfReader, pageNum, new LocationTextExtractionStrategy());
// Encode new lines to prevent from line breaking in text editors,
// I want nice line after line files
return text.Replace("\r", "\\r").Replace("\n", "\\n");
}
}
}
您可以用Python编写类似的内容。
Ref-https://devblog.xyz/simple-pdf-text-extractor-adla/
希望这会有所帮助。