PDF OCR VB.NET Library
How to read, extract text from scanned PDF file using OCR vb.net library
VB.NET Tutorial for Using OCR Library to Extract Text from Adobe PDF Document in Visual Basic Class
In this vb.net tutorial, you will learn how to use XDoc.PDF and OCR vb.net library to read and extract text content from
scanned PDF file or from images inside PDF document.
- Extract text from scanned PDF file
- Using OCR to convert scanned PDF file to editable PDF document or Microsoft Word document
- Quick to enable PDF OCR features in VB.NET Windows Forms, WPS, ASP.NET applications.
How to OCR PDF using Visual Basic .NET
- Best VB.NET OCR SDK for Visual Studio .NET
- Scan text content from adobe PDF document in Visual Basic.NET application
- Able to specify any area of PDF to perform OCR function in .NET WinForms and ASP.NET webpage
- .NET library for batching OCR PDF text content in VB.NET
- Support .NET WinForms, ASP.NET MVC in IIS, ASP.NET Ajax, Azure cloud service, DNN (DotNetNuke), SharePoint
- Recognize the whole PDF document and get all text content in VB.NET
- Recognize a page of PDF document and extract its text content in Visual Basic .NET class
- Recognize scanned PDF file and output OCR result to adobe PDF file
- Recognize scanned PDF document and output OCR result to MS Word file
- Online VB.NET class source code for evaluation
- Free VB.NET components and controls for downloading and using in .NET framework
How to OCR, read text from a scanned PDF file using VB.NET?
The steps and sample VB.NET code below shows how to read text content from a PDF file using Visual Basic.
- Set OCR resource files path through OCRHandler.SetTrainResourcePath() method.
- Create a new PDFDocument object with a scanned PDF file loaded.
- Get the first page of PDF document and convert to image Bitmap object
- Create a new OCRPage object with the PDF page image loaded
- Call OCRPage.Recognize() method to scan and extract text from PDF page
- Save extracted text content to a TXT file.
String ocrSource = @"D:\Alice\DLL\Source\";
OCRHandler.SetTrainResourcePath(ocrSource);
PDFDocument pdf = new PDFDocument(@"C:\input.pdf");
BasePage page = pdf.GetPage(0);
Bitmap bmp = page.ConvertToImage();
OCRPage ocrPage = OCRHandler.Import(bmp);
ocrPage.Recognize();
ocrPage.SaveTo(MIMEType.TXT, @"C:\output.txt");
Convert scanned PDF file to editable pdf document using VB.NET
Add the following VB.NET example source code will show how to convert scanned pdf document into editable PDF file
String inputFilePath = @"C:\demo_1.pdf";
String outputFilePath = @"C:\output.pdf";
// The folder that contains '.traineddata' files.
OCRHandler.SetTrainResourcePath(@"C:\Source");
PDFDocument doc = new PDFDocument(inputFilePath);
int pageCount = doc.GetPageCount();
MemoryStream[] streams = new MemoryStream[pageCount];
for (int i = 0; i < doc.GetPageCount(); i++)
{
streams[i] = new MemoryStream();
OCRPage page = OCRHandler.Import(doc.GetPage(i));
page.Recognize();
page.SaveTo(MIMEType.PDF, streams[i]);
}
PDFDocument.CombineDocument(streams, outputFilePath);
Convert scanned PDF file to word document (.docx) using VB.NET
Add the following VB.NET example source code will show how to convert scanned pdf document into Microsoft Word document (.docx)
String inputFilePath = @"C:\demo_1.pdf";
String tempFilePath = @"C:\output.pdf";
String outputFilePath = @"C:\output.docx";
// The folder that contains '.traineddata' files.
OCRHandler.SetTrainResourcePath(@"C:\Source");
PDFDocument doc = new PDFDocument(inputFilePath);
int pageCount = doc.GetPageCount();
MemoryStream[] streams = new MemoryStream[pageCount];
for (int i = 0; i < doc.GetPageCount(); i++)
{
streams[i] = new MemoryStream();
OCRPage page = OCRHandler.Import(doc.GetPage(i));
page.Recognize();
page.SaveTo(MIMEType.PDF, streams[i]);
}
PDFDocument.CombineDocument(streams, tempFilePath);
PDFDocument doc1 = new PDFDocument(tempFilePath);
doc1.ConvertToDocument(DocumentType.DOCX, outputFilePath);