Using OcrFormatProvider
| Minimum version | Q1 2025 |
|---|
Use OcrFormatProvider to recognize text in scanned images and turn that image content into searchable PDF content. The provider imports an image, runs an Optical Character Recognition (OCR) engine over it, and returns a RadFixedPage that you can place in a PDF document.
By default, OcrFormatProvider works with TesseractOcrProvider, which uses the third-party Tesseract OCR engine. If the default provider does not match your platform or deployment model, you can plug in a custom OCR provider implementation instead.
See the PdfProcessing Optical Character Recognition demo for a working example of the feature.
Review the OCR prerequisites article before you start. That article contains the required packages, tessdata setup, native Tesseract dependencies, and Linux-specific installation steps.
Use this article when you need to:
- Convert a scanned image into a searchable PDF page.
- Configure
TesseractOcrProviderfor one or more recognition languages. - Understand which platforms support the default Tesseract-based provider.
- Decide when to use a custom OCR provider.
- Diagnose common OCR setup and recognition issues.
Supported Platforms
The default TesseractOcrProvider implementation is supported on Windows and Linux. If your application runs on a platform where the default Tesseract integration is unavailable, use a custom OCR provider.
For Blazor WebAssembly, do not assume that the default Tesseract-based provider is available. Review the package guidance in OCR prerequisites and use a custom provider when your deployment model cannot use the default native Tesseract integration.
If your application uses cross-platform image processing dependencies, also review Cross-Platform Support - Images.
End-to-End OCR Workflow
In a typical OCR workflow, the input is a scanned image file and the output is a searchable PDF page inside a RadFixedDocument:
- Set up the required OCR packages and native dependencies.
- Place the
tessdatafolder in a location that your application can access. - Create a
TesseractOcrProviderand point it to the parent folder that containstessdata. - Configure the OCR options, such as language codes and parse level.
- Create an
OcrFormatProviderwith that OCR provider. - Import the image and get a
RadFixedPageresult. - Add the returned page to a
RadFixedDocumentand export the document to PDF if needed.
This flow gives you a PDF page that contains the recognized text in a searchable form instead of a plain image-only scan.
TesseractOcrProvider Public API
The following example shows the usual setup for OcrFormatProvider and TesseractOcrProvider.
// Requirement for Images in .NET Standard - https://docs.telerik.com/devtools/document-processing/libraries/radpdfprocessing/cross-platform/images
//FixedExtensibilityManager.ImagePropertiesResolver = new ImagePropertiesResolver();
TesseractOcrProvider tesseractOcrProvider = new TesseractOcrProvider(".");
tesseractOcrProvider.LanguageCodes = new List<string>() { "eng" };
//tesseractOcrProvider.CorrectVerticalPosition = false; // Available in .NET Standard
tesseractOcrProvider.DataPath = @"..\..\..\";
tesseractOcrProvider.ParseLevel = OcrParseLevel.Line;
string imagePath = @"..\..\..\images\image.png";
string imageText = tesseractOcrProvider.GetAllTextFromImage(File.ReadAllBytes(imagePath));
Dictionary<Rectangle, string> imageTextAndTextDimentions = tesseractOcrProvider.GetTextFromImage(File.ReadAllBytes(imagePath));
OcrFormatProvider OcrProvider = new OcrFormatProvider(tesseractOcrProvider);
RadFixedDocument document = new RadFixedDocument();
RadFixedPage page = new RadFixedPage();
page = OcrProvider.Import(new FileStream(imagePath, FileMode.Open), null);
document.Pages.Add(page);
string outputPath = "output.pdf";
PdfFormatProvider pdfFormatProvider = new PdfFormatProvider();
using (Stream output = File.OpenWrite(outputPath))
{
pdfFormatProvider.Export(document, output, TimeSpan.FromSeconds(10));
}
After the import completes, the provider returns a RadFixedPage that you can add to a document, inspect, or export.
TesseractOcrProvider Settings
Use these members to control how the default OCR engine behaves:
| Method or property | Description |
|---|---|
TesseractOcrProvider(string dataPath) | Initializes the provider with the path to the parent directory that contains the tessdata folder. |
LanguageCodes | Sets the language codes for the Tesseract OCR engine. The default value is eng. Download the required trained data files from the Tesseract tessdata repository. |
CorrectVerticalPosition | Tries to correct the vertical position of the recognized text. This option is not available in .NET Framework. |
DataPath | Gets or sets the path to the parent directory that contains the tessdata folder. |
ParseLevel | Controls whether OCR parsing returns text by line or by word through OcrParseLevel.Line or OcrParseLevel.Word. |
GetAllTextFromImage | Extracts all recognized text from an image and returns it as a single string. |
GetTextFromImage | Extracts recognized text and returns words together with their bounding rectangles. |
Setup Tips for Better OCR Results
Check these setup points before you troubleshoot recognition quality:
- Use images with 300 DPI when possible.
- Make sure the
tessdatafolder containseng.traineddataand any additional languages that you configure inLanguageCodes. - Pass the parent folder of
tessdatatoTesseractOcrProvider, not thetessdatafolder itself. - Verify that the required native Tesseract files are available on the target machine.
- On Linux, complete the extra installation steps from OCR prerequisites.
Troubleshooting
Use these checks when OCR does not behave as expected:
- OCR initialization fails: Confirm that the required packages and native dependencies are installed.
- No text is recognized: Verify image quality, DPI, and whether the correct trained language files are present in
tessdata. - Wrong language is recognized: Check the
LanguageCodessetting and confirm that the matching.traineddatafiles are deployed. - Text positions look off: Try the
CorrectVerticalPositionoption when the runtime supports it. - The feature does not run on the target platform: Confirm whether the application uses the default Windows and Linux Tesseract integration or whether it needs a custom OCR provider.
Next Steps
Continue with the article that matches your next task:
- Use OCR prerequisites to complete package and native dependency setup.
- Review Implementing a Custom OCR Provider if you need a different OCR engine or unsupported platform coverage.
- Use Extracting Text from PDF Documents when you need to process searchable PDF content after OCR.
A complete example that implements
OcrFormatProvideris available in the Document Processing SDK repository.