How to Start Tutorials Troubleshooting Main Operations Convert PDF Read PDF Edit PDF PDF Report Generator Work with PDF Modules PDF Document PDF Pages Text Image Graph & Path Annotation, Markup & Drawing Redaction Security Digital Signature Forms Watermark Bookmark Link File Attachment File Metadata Printing Work with Other SDKs Barcode read Barcode create OCR Twain

C# PDF OCR Library
How to read, extract text from scanned PDF file using c# .net


How to Extract Text from Adobe PDF Document Using .NET OCR Library in Visual C#











In this tutorial, you learn how to use XDoc.PDF C# library and OCR sdk to read, extract text content from scanned PDF document or from images inside a PDF file.

How to extract text content from scanned pdf file using C#

  1. Download XDoc.PDF C# library and OCR SDK
  2. Install C# library to OCR PDF document
  3. Step by Step Tutorial


























  • Best OCR SDK for Visual Studio .NET
  • Scan text content from adobe PDF document in .NET WinForms
  • Specify any area of PDF to perform OCR
  • .NET library for batching OCR PDF text content
  • .NET DLLs can be easily to be integrated into ASP.NET project
  • Support .NET WinForms, ASP.NET MVC in IIS, ASP.NET Ajax, Azure cloud service, DNN (DotNetNuke), SharePoint
  • Recognize the whole PDF document and get all text content
  • Recognize a page of PDF document and extract its text content
  • Recognize scanned PDF file and output OCR result to Adobe PDF file
  • Recognize scanned PDF document and output OCR result to MS Word file
  • Online C# class source code for ocr text extraction in .NET
  • Free components and controls for downloading and using in .NET framework




C# extract text content from PDF document


Add the following C# OCR PDF text demo code to your project.



String ocrSource = @"D:\Alice\DLL\Source\";
OCRHandler.SetTrainResourcePath(ocrSource);
PDFDocument pdf = new PDFDocument(@"C:\input.pdf");
BasePage page = pdf.GetPage(0);
Bitmap bmp = page.ConvertToImage();
OCRPage ocrPage = OCRHandler.Import(bmp);
ocrPage.Recognize();
ocrPage.SaveTo(MIMEType.TXT, @"C:\output.txt");




Convert scanned PDF file to editable pdf document using C#


Add the following C# example source code will show how to convert scanned pdf document into editable PDF file



String inputFilePath = @"C:\demo_1.pdf";
String outputFilePath = @"C:\output.pdf";

// The folder that contains '.traineddata' files.
OCRHandler.SetTrainResourcePath(@"C:\Source");

PDFDocument doc = new PDFDocument(inputFilePath);
int pageCount = doc.GetPageCount();

MemoryStream[] streams = new MemoryStream[pageCount];
for (int i = 0; i < doc.GetPageCount(); i++)
{
    streams[i] = new MemoryStream();
    OCRPage page = OCRHandler.Import(doc.GetPage(i));
    page.Recognize();
    page.SaveTo(MIMEType.PDF, streams[i]);
}
PDFDocument.CombineDocument(streams, outputFilePath);




Convert scanned PDF file to word document (.docx) using C#


Add the following C# example source code will show how to convert scanned pdf document into Microsoft Word document (.docx)



String inputFilePath = @"C:\demo_1.pdf";
String tempFilePath = @"C:\output.pdf";
String outputFilePath = @"C:\output.docx";

// The folder that contains '.traineddata' files.
OCRHandler.SetTrainResourcePath(@"C:\Source");

PDFDocument doc = new PDFDocument(inputFilePath);
int pageCount = doc.GetPageCount();

MemoryStream[] streams = new MemoryStream[pageCount];
for (int i = 0; i < doc.GetPageCount(); i++)
{
    streams[i] = new MemoryStream();
    OCRPage page = OCRHandler.Import(doc.GetPage(i));
    page.Recognize();
    page.SaveTo(MIMEType.PDF, streams[i]);
}
PDFDocument.CombineDocument(streams, tempFilePath);

PDFDocument doc1 = new PDFDocument(tempFilePath);
doc1.ConvertToDocument(DocumentType.DOCX, outputFilePath);