C# PDF OCR Library
How to read, extract text from scanned PDF file using c# .net
How to Extract Text from Adobe PDF Document Using .NET OCR Library in Visual C#
In this tutorial, you learn how to use XDoc.PDF C# library and OCR sdk to read, extract text content from scanned PDF document or from images inside a PDF file.
How to extract text content from scanned pdf file using C#
- Best OCR SDK for Visual Studio .NET
- Scan text content from adobe PDF document in .NET WinForms
- Specify any area of PDF to perform OCR
- .NET library for batching OCR PDF text content
- .NET DLLs can be easily to be integrated into ASP.NET project
- Support .NET WinForms, ASP.NET MVC in IIS, ASP.NET Ajax, Azure cloud service, DNN (DotNetNuke), SharePoint
- Recognize the whole PDF document and get all text content
- Recognize a page of PDF document and extract its text content
- Recognize scanned PDF file and output OCR result to Adobe PDF file
- Recognize scanned PDF document and output OCR result to MS Word file
- Online C# class source code for ocr text extraction in .NET
- Free components and controls for downloading and using in .NET framework
C# extract text content from PDF document
Add the following C# OCR PDF text demo code to your project.
String ocrSource = @"D:\Alice\DLL\Source\";
OCRHandler.SetTrainResourcePath(ocrSource);
PDFDocument pdf = new PDFDocument(@"C:\input.pdf");
BasePage page = pdf.GetPage(0);
Bitmap bmp = page.ConvertToImage();
OCRPage ocrPage = OCRHandler.Import(bmp);
ocrPage.Recognize();
ocrPage.SaveTo(MIMEType.TXT, @"C:\output.txt");
Convert scanned PDF file to editable pdf document using C#
Add the following C# example source code will show how to convert scanned pdf document into editable PDF file
String inputFilePath = @"C:\demo_1.pdf";
String outputFilePath = @"C:\output.pdf";
// The folder that contains '.traineddata' files.
OCRHandler.SetTrainResourcePath(@"C:\Source");
PDFDocument doc = new PDFDocument(inputFilePath);
int pageCount = doc.GetPageCount();
MemoryStream[] streams = new MemoryStream[pageCount];
for (int i = 0; i < doc.GetPageCount(); i++)
{
streams[i] = new MemoryStream();
OCRPage page = OCRHandler.Import(doc.GetPage(i));
page.Recognize();
page.SaveTo(MIMEType.PDF, streams[i]);
}
PDFDocument.CombineDocument(streams, outputFilePath);
Convert scanned PDF file to word document (.docx) using C#
Add the following C# example source code will show how to convert scanned pdf document into Microsoft Word document (.docx)
String inputFilePath = @"C:\demo_1.pdf";
String tempFilePath = @"C:\output.pdf";
String outputFilePath = @"C:\output.docx";
// The folder that contains '.traineddata' files.
OCRHandler.SetTrainResourcePath(@"C:\Source");
PDFDocument doc = new PDFDocument(inputFilePath);
int pageCount = doc.GetPageCount();
MemoryStream[] streams = new MemoryStream[pageCount];
for (int i = 0; i < doc.GetPageCount(); i++)
{
streams[i] = new MemoryStream();
OCRPage page = OCRHandler.Import(doc.GetPage(i));
page.Recognize();
page.SaveTo(MIMEType.PDF, streams[i]);
}
PDFDocument.CombineDocument(streams, tempFilePath);
PDFDocument doc1 = new PDFDocument(tempFilePath);
doc1.ConvertToDocument(DocumentType.DOCX, outputFilePath);