|
|
C# PDF Text Search Library
How to get, search text with coordinates from PDF file using C# .net
C# guide about how to search text with regular expressions in PDF document and obtain text search results with coordinates in C# ASP.NET, Windows application
In this C# tutorial, you learn how to search text in PDF file in the C# ASP.NET Core, MVC, Web, Windows applications.
- Search specified text in PDF document, pages, page regions
- Search horizontal or vertical text using regular expressions
- Search and get coordinates of text search results in pdf document
- Search, find, and replace or mark up text within PDF document using C# .NET API for .NET Core and framework.
How to search, get coordinates of text in PDFs programmatically using C#
- Best Visual Studio .NET PDF document SDK , built on .NET framework 2.0 and compatible with Windows operating system
- C# PDF text library:
c# PDF extract text,
replace text in pdf using c#,
c# remove text from pdf,
c# remove images from pdf,
extract image from pdf c# pdfs,
how to add image in pdf in c#.
- Free components and library are easy to be integrated in .NET WinForms application and ASP.NET for searching adobe PDF text in C# class
- Support .NET Core, ASP.NET Core MVC, .NET WinForms, ASP.NET MVC in IIS, ASP.NET Ajax, Azure cloud service, DNN (DotNetNuke), SharePoint
- C# class sample code for searching text from specified PDF pages in .NET console application
- Able to find and get PDF text position details in C#.NET application
- Allow to search defined PDF file page or the whole document
- Support search PDF file with various search options, like whole word, ignore case, match string, etc
- Ability to search and replace PDF text in ASP.NET programmatically
About text search on PDF
Using XDoc.PDF for .NET sdk, you can easily do text search on PDF document. you can find and location text through the following methods:
- Do search and find text
- Using regular expression to search and find text
- Find all text inside a page region
- Find the text char by the page position
Text search options
Using c#, you run searches to find specific text items in PDF file. You can run a simple text search, looking for a search term within list of PDF pages, or a page region.
Or you can use advanced search options, and search PDF document. Search Options and example C# source code:
- WholeWord: Finds only occurrences of the complete word. For example, if you search for the word inside, the words in and side aren't found.
- IgnoreCase: Finds only occurrences of the words that match the capitalization you provide. For example, if you search for the word White, the words white and WHITE aren't found.
- ContextExpansion: The number or chars will be returned with searched text
RESearchOption searchOps = new RESearchOption();
searchOps.MatchString = "RasterEdge";
searchOps.IgnoreCase = true;
searchOps.WholeWord = false;
searchOps.ContextExpansion = 0;
Text search with regular expression
In C#, you can do advanced text search with regular expression. The following C# example source code support text search on urls.
// Search pattern for URL
String pattern = @"\b(\S+)://(\S+)\b";
RegexOptions regexOps = RegexOptions.IgnoreCase;
Get text search results coordinates
After you do a text search on a pdf file using C#, you will get a list of SearchResultItem objects. Each SearchResultItem object has
one property CombinedResultArea, which contains the text coordinates information.
- Area.X: the text coordinates, left top point X value on the pdf page
- Area.Y: the text coordinates, left top point Y value on the pdf page
- Area.Width: the text coordinates, area width
- Area.Height: the text coordinates, area height
// Apply searching
SearchResult result = doc.Search(matchString, searchOps, pageOffset, pageCount);
// Show result
if (result.HaveMatched)
{
foreach (SearchResultItem item in result.Result)
{
Console.WriteLine("Matched String: '{0}'", item.MatchedString);
Console.WriteLine("Context String: '{0}'", item.ContextString);
Console.WriteLine("Result Area(s):");
foreach (SearchResultLocation area in item.CombinedResultArea)
Console.WriteLine(" {0}: {1},{2}; W={3}; H={4}", area.PageIndex,
area.Area.X.ToPixel(), area.Area.Y.ToPixel(),
area.Area.Width.ToPixel(), area.Area.Height.ToPixel());
}
}
Using C#, do text search on PDF document
This section content will explain how to do text search on pdf whole document, a specified page, or page region.
C# search text from whole pdf document
The C# code below shows how to do a text search on a pdf document.
#region search text from pdf document
internal static void searchTextFromDocument()
{
String inputFilePath = @"C:\demo.pdf";
// Open a document.
PDFDocument doc = new PDFDocument(inputFilePath);
// Set the search options
RESearchOption option = new RESearchOption();
option.IgnoreCase = true;
option.WholeWord = true;
option.ContextExpansion = 10;
// Search text and save it to SearchResult.
SearchResult results = doc.Search("RasterEdge", option);
}
#endregion
C# search text from specified pdf page
The C# code below shows how to do a text search on a pdf page.
String inputFilePath = @"C:\1.pdf";
// Open file
PDFDocument doc = new PDFDocument(inputFilePath);
// Search text "RasterEdge"
String matchString = "RasterEdge";
// Set search option
RESearchOption searchOps = new RESearchOption();
searchOps.MatchString = matchString;
searchOps.IgnoreCase = true;
searchOps.WholeWord = false;
searchOps.ContextExpansion = 10;
// Set search range (on the first page)
int pageOffset = 0;
int pageCount = 1;
// Apply searching
SearchResult result = doc.Search(matchString, searchOps, pageOffset, pageCount);
// Show result
if (result.HaveMatched)
{
foreach (SearchResultItem item in result.Result)
{
Console.WriteLine("Matched String: '{0}'", item.MatchedString);
Console.WriteLine("Context String: '{0}'", item.ContextString);
Console.WriteLine("Result Area(s):");
foreach (SearchResultLocation area in item.CombinedResultArea)
Console.WriteLine(" {0}: {1},{2}; W={3}; H={4}", area.PageIndex,
area.Area.X.ToPixel(), area.Area.Y.ToPixel(),
area.Area.Width.ToPixel(), area.Area.Height.ToPixel());
}
}
C# search text from consecutive pdf pages
The C# code below shows how to do a text search on a pdf pages range.
String inputFilePath = @"C:\1.pdf";
// Open file
PDFDocument doc = new PDFDocument(inputFilePath);
// Search text "RasterEdge"
String matchString = "RasterEdge";
// Set search option
RESearchOption searchOps = new RESearchOption();
searchOps.MatchString = matchString;
searchOps.IgnoreCase = true;
searchOps.WholeWord = false;
searchOps.ContextExpansion = 10;
// Set search page range (from page 1 to 3)
int pageOffset = 0;
int pageCount = 3;
// Apply searching
SearchResult result = doc.Search(matchString, searchOps, pageOffset, pageCount);
// Show result
if (result.HaveMatched)
{
foreach (SearchResultItem item in result.Result)
{
Console.WriteLine("Matched String: '{0}'", item.MatchedString);
Console.WriteLine("Context String: '{0}'", item.ContextString);
Console.WriteLine("Result Area(s):");
foreach (SearchResultLocation area in item.CombinedResultArea)
Console.WriteLine(" {0}: {1},{2}; W={3}; H={4}", area.PageIndex,
area.Area.X.ToPixel(), area.Area.Y.ToPixel(),
area.Area.Width.ToPixel(), area.Area.Height.ToPixel());
}
}
Search text from the specified page region
The C# code below shows how to do a text search on a pdf page region
String inputFilePath = @"C:\1.pdf";
// Open file
PDFDocument doc = new PDFDocument(inputFilePath);
// Search text "RasterEdge"
String matchString = "RasterEdge";
// Set search option
RESearchOption searchOps = new RESearchOption();
searchOps.MatchString = matchString;
searchOps.IgnoreCase = true;
searchOps.WholeWord = false;
searchOps.ContextExpansion = 10;
// Set target page region in the 1st page.
int pageIndex = 0;
// Region: start point (0,0), with = 500, height = 300. Unit: pixel (in 96 dpi).
RectangleF pageRegion = new RectangleF(0, 0, 500, 300);
// Apply searching
SearchResult result = doc.Search(matchString, searchOps, pageIndex, pageRegion);
// Show result
if (result.HaveMatched)
{
foreach (SearchResultItem item in result.Result)
{
Console.WriteLine("Matched String: '{0}'", item.MatchedString);
Console.WriteLine("Context String: '{0}'", item.ContextString);
Console.WriteLine("Result Area(s):");
foreach (SearchResultLocation area in item.CombinedResultArea)
Console.WriteLine(" {0}: {1},{2}; W={3}; H={4}", area.PageIndex,
area.Area.X.ToPixel(), area.Area.Y.ToPixel(),
area.Area.Width.ToPixel(), area.Area.Height.ToPixel());
}
}
Using C#, do text search with regular expression on PDF document
This section content will explain how to do text search with regular expression on pdf whole document, a specified page, or page region.
Search text with regular expression from the specified page(s)
The C# code below shows how to do a text search with regular expression on pdf pages.
String inputFilePath = @"C:\1.pdf";
// Open file
PDFDocument doc = new PDFDocument(inputFilePath);
// Search pattern for URL
String pattern = @"\b(\S+)://(\S+)\b";
RegexOptions regexOps = RegexOptions.IgnoreCase;
// Set search range (from page 1 to 3)
int pageOffset = 0;
int pageCount = 3;
// Apply searching
MatchResult result = doc.Search(pattern, regexOps, pageOffset, pageCount);
// Show result
if (result.HaveMatched)
{
foreach (SearchResultItem item in result.GetResult())
{
Console.WriteLine("Matched String: '{0}'", item.MatchedString);
Console.WriteLine("Context String: '{0}'", item.ContextString);
Console.WriteLine("Result Area(s):");
foreach (SearchResultLocation area in item.CombinedResultArea)
Console.WriteLine(" {0}: {1},{2}; W={3}; H={4}", area.PageIndex,
area.Area.X.ToPixel(), area.Area.Y.ToPixel(),
area.Area.Width.ToPixel(), area.Area.Height.ToPixel());
}
}
else
Console.WriteLine("No Matched Item");
Search text with regular expression from specified page region
The C# code below shows how to do a text search with regular expression on a pdf page region.
String inputFilePath = @"C:\1.pdf";
// Open file
PDFDocument doc = new PDFDocument(inputFilePath);
// Search pattern for URL
String pattern = @"\b(\S+)://(\S+)\b";
RegexOptions regexOps = RegexOptions.IgnoreCase;
// Set target page region in the 1st page.
int pageIndex = 0;
// Region: start point (0,0), with = 500, height = 300. Unit: pixel (in 96 dpi).
RectangleF pageRegion = new RectangleF(0, 0, 500, 300);
// Apply searching
MatchResult result = doc.Search(pattern, regexOps, pageIndex, pageRegion);
// Show result
if (result.HaveMatched)
{
foreach (SearchResultItem item in result.GetResult())
{
Console.WriteLine("Matched String: '{0}'", item.MatchedString);
Console.WriteLine("Context String: '{0}'", item.ContextString);
Console.WriteLine("Result Area(s):");
foreach (SearchResultLocation area in item.CombinedResultArea)
Console.WriteLine(" {0}: {1},{2}; W={3}; H={4}", area.PageIndex,
area.Area.X.ToPixel(), area.Area.Y.ToPixel(),
area.Area.Width.ToPixel(), area.Area.Height.ToPixel());
}
}
else
Console.WriteLine("No Matched Item");
C# search and replace text from pdf document
The C# code below shows how to do a text search and replace on a pdf. If you need know more about
text replace function in PDF, please go to
https://www.rasteredge.com/how-to/csharp-imaging/pdf-text-edit-replace/.
#region search and replace text from pdf document
internal static void searchAndReplaceTextFromDocument()
{
String inputFilePath = @"C:\demo.pdf";
// Open a document.
PDFDocument doc = new PDFDocument(inputFilePath);
// Set the search options.
RESearchOption option = new RESearchOption();
option.IgnoreCase = true;
option.WholeWord = true;
option.ContextExpansion = 10;
// Replace "RasterEdge" with "Image".
doc.Replace("RasterEdge", "Image", option);
doc.Save(@"C:\output.pdf");
}
#endregion
Common Asked Questions
How do you search text in a PDF file?
You can do text search in a PDF file on PDF reader or editor program. Press "Ctrl + F" to start text search.
Using RasterEdge C# PDF library, you can search target text on a PDF document with one line of C# code in your ASP.NET web application.
How to search and extract specific text from a PDF?
Once you have searched the specific text on a PDF file, you can copy and paste the searched text result to other document applications, such as Microsoft Word program.
Using RasterEdge EdgePDF ASP.NET PDF Editor web app, you can do text search, and copy paste searched text with formatted style to Microsoft Word application.
EdgePDF asp.net PDF web control is based on XDoc.PDF C# library.
Why is my PDF not letting me search text?
If the PDF document owner has restrict the document to prevent text search, or the PDF document does not include text contents, and all text are printed in one or several images,
you cannot do text directly from a PDF document.
Using C# PDF library, you can do text search on a scanned PDF with XImage.OCR sdk.
What's the difference between a PDF and a searchable PDF?
A searchable PDF document is usually a scanned PDF file which contains images only. To do text search on a scanned pdf file, you need
convert the scanned PDF file to text PDF file using OCR engines.
Using XDoc.PDF and XImage.OCR C# library, you can easily do text search on scanned PDF file in your ASP.NET, MVC, Blazor web applications.
How do I know if a PDF is text searchable?
You can try to select, mark highlight annotations on text. If you cannot select or hgihlight any text on the PDF document, then the PDF file is not
searchable, and it contains images only.
Using C# PDF library, you can do target text search on both searchable PDF file and scanned PDF file in your ASP.NET, WinForms applications.
How to search for highlighted text in PDF?
Open a PDF file with highlighted text in EdgePDF web demo application in browser. Open the right panel to view all annotations on the PDF file.
Once you choose and click the text annotation, the highlighted text will be appeared in the PDF document.
EdgePDF is an ASP.NET PDF viewer and editor web control, and it is developed using RasterEdge C# PDF library.
Is it possible to search for metadata in a PDF?
You can export the PDF metadata into XMP xml file. Then you can do search on the XMP file in a XML text editor.
Using C# PDF library, you can read the PDF metadata in PDFMetadata or XmlDocument object, and you can explore the data inside the metadata object.
in your C# ASP.NET, MVC, web and Windows applications.
|