How to Start Tutorials Troubleshooting Main Operations Convert PDF Read PDF Edit PDF PDF Report Generator Work with PDF Modules PDF Document PDF Pages Text Image Graph & Path Annotation, Markup & Drawing Redaction Security Digital Signature Forms Watermark Bookmark Link File Attachment File Metadata Printing Work with Other SDKs Barcode read Barcode create OCR Twain

C# PDF Text Reader Library
How to read, extract text from PDF file using C#


C# Demo Code to read, extract text from Adobe PDF document





In this C# tutorial, you will learn how to read, extract text from PDF file using C# in ASP.NET MVC Web, Windows applications.

  • Read text from all pages, specified pages, or from a page region on PDF
  • Extract text with lines
  • Read, extract special formated text, such as highlighted text content in PDF





Read text content from a PDF page region using C#


The C# source code below will show you how to use class PDFTextMgr to read text from a region on PDF page using C# in ASP.NET MVC Web, Windows applications.

  • Get PDFTextMgr object from method PDFTextHandler.ExportPDFTextManager() with a PDF file loaded
  • Utilize method PDFTextMgr.SelectChar() to get all text characters at specified postion from the first PDF page
  • Alos utilize method PDFTextMgr.SelectChar() to get all text characters at specified region RectangleF from the first PDF page



//  open a document
String inputFilePath = Program.RootPath + "\\" + "2.pdf";
PDFDocument doc = new PDFDocument(inputFilePath);
//  get a text manager from the document object
PDFTextMgr textMgr = PDFTextHandler.ExportPDFTextManager(doc);

//  get the first page from the document
int pageIndex = 0;
PDFPage page = (PDFPage)doc.GetPage(pageIndex);


//  select char at position (245F, 155F)
PointF cursor = new PointF(245F, 155F);
PDFTextCharacter aChar = textMgr.SelectChar(page, cursor);
if (aChar == null)
{
    Console.WriteLine("No character has been found.");
}
else
{
    Console.WriteLine("Value: " + aChar.GetChar() + "; Boundary: " + aChar.GetBoundary().ToString());
}

//  select chars in the region (250F, 150F, 100F, 100F)
RectangleF region = new RectangleF(250F, 150F, 100F, 100F);
List<PDFTextCharacter> chars = textMgr.SelectChar(page, region);
foreach (PDFTextCharacter obj in chars)
{
    Console.WriteLine("Value: " + obj.GetChar() + "; Boundary: " + obj.GetBoundary().ToString());
}




Read line text from a PDF page region in C# code


//  select a line at 150F from the top of the page
PDFTextLine aLine = textMgr.SelectLine(page, 150F);
if (aLine == null)
{
    Console.WriteLine("No character has been found.");
}
else
{
    Console.WriteLine("Line: " + aLine.GetContent());
}




How to read, extract highlighted text from PDF using C#


The code below is only for text markup annotations


  • PDFAnnotHighlight
  • PDFAnnotUnderLine
  • PDFAnnotDeleteLine
  • PDFAnnotTextReplace


String inputFilePath = Program.RootPath + "\\" + "1.pdf";

//  Open the PDF file.
PDFDocument doc = new PDFDocument(inputFilePath);
//  Retreive all annotations in the document.
List<IPDFAnnot> annots = PDFAnnotHandler.GetAllAnnotations(doc);
foreach (IPDFAnnot annot in annots)
{
    //  For PDFAnnotHighlight, PDFAnnotUnderLine, PDFAnnotDeleteLine and PDFAnnotTextReplace.
    if (annot is IPDFMarkupAnnot)
    {
        //  Get the parent page of the annotation.
        PDFPage page = (PDFPage)doc.GetPage(annot.PageIndex);

        //  Extract text from the target text markup annotation.
        String[] text = PDFAnnotHandler.ExtractText(page, (IPDFMarkupAnnot)annot);
        //  Show the markup text related to the annotation.
        Console.WriteLine("Content: ");
        foreach (String line in text)
        {
            Console.WriteLine(line);
        }
    }
}