Summarize with AI:
Here’s what you need to get started with Telerik Document Processing Libraries to work with PDF, Word and Excel files (and, like any good suite, make all those document types look very much alike).
Progress Telerik Document Processing Libraries, in addition to letting you work with a variety of document formats (PDF, DOCX, RTF, HTML, XLSX and more), are an example of the reason you buy into a suite of tools: All the tools bear a “family resemblance.”
The ideal scenario, of course, would be for a single tool that made all these document formats look the same. Given the differences in format and functionality between, for example, a PDF, a Microsoft Word (or RTF or HTML) document and an Excel spreadsheet, that’s not reasonable (though Progress Telerik has achieved that with DOCX, RTF and HTML documents).
The good news here is that, with Telerik Document Processing Libraries (DPL), the family resemblance is strong enough that, for all of the document types the library supports, I can show you how to load documents, start the editing process, convert between various document types and save a document in this one post.
The sample code in this post was all written using the DPL for Windows libraries (even though I was working in ASP.NET Core—the more technical name for the version I used is “.NET(Target OS: Windows).” The suite is also available for the .NET Framework.
To create an ASP.NET Core project that would work with “all the documents,” I added these NuGet packages to my project:
For the Excel spreadsheets, I’m only going to work with XLSX files, so I added the Telerik.Windows.Documents.Spreadsheet.FormatProviders.OpenXml package to my project. (If I was going to work with, for example, XLS spreadsheet, then I would have added the Telerik.Documents.Spreadsheet.FormatProviders.Xls package.)
The code to load a document from a file into any of these libraries is very similar:
File object’s OpenRead to create a Stream that points to the file.Import method to load the stream into the document object.Here, for example, is the code to load a PDF file into a RadFixedDocument object (I used this code in an ASP.NET Core application with documents in my project’s wwwroot folder):
RadFixedDocument doc;
PdfFormatProvider prov = new();
using (Stream str = File.OpenRead(@"wwwroot/documents/Priorities.pdf"))
{
doc = prov.Import(str, TimeSpan.FromSeconds(10));
}
And here’s the code to load a DOCX file into a RadFlowDocument:
RadFlowDocument doc;
DocxFormatProvider prov = new();
using (Stream str = File.OpenRead(@"wwwroot/documents/Priorities.docx"))
{
doc = prov.Import(str, TimeSpan.FromSeconds(10));
}
As you can see, the code is identical except for the provider (PdfFormatProvider vs. DocxFormatProvider) and document objects (RadFlowDocument vs. RadFixedDocument).
Because both RTF and HTML documents load into the same RadFlowDocument object as a DOCX document, only the provider object changes (HtmlFormatProvider or RtfFormatProdiver instead of DocxFormatProvider) when working with those file formats. Here’s the code to load an RTF document:
RadFlowDocument doc;
RtfFormatProvider prov = new();
using (Stream str = File.OpenRead(@"wwwroot/documents/Priorities.rtf"))
{
doc = prov.Import(str, TimeSpan.FromSeconds(10));
}
And here’s the almost identical code to load an HTML file:
RadFlowDocument doc;
HtmlFormatProvider prov = new();
using (Stream str = File.OpenRead(@"wwwroot/documents/Priorities.html"))
{
doc = prov.Import(str, TimeSpan.FromSeconds(10));
}
The code to load an Excel workbook is also similar to what you’ve seen before, just swapping in a new document object (Workbook) and provider (XlsxFormatProvider):
Workbook doc;
XlsxFormatProvider prov = new();
using (Stream str = File.OpenRead(@"wwwroot/documents/priority.xlsx"))
{
doc = prov.Import(str, TimeSpan.FromSeconds(10));
}
A note: The Workbook object assumes that you’re going to load the whole Excel workbook into memory. For very large workbooks, that may not make sense. For that scenario, you should look at the SpreadStreamProcessing Library.
Once you’ve loaded the documents, you can start working with them. You can often simplify your code by using the RadFixedDocumentEditor with PDF documents or the RadFlowDocumentEditorwith Word/RTF/HTML documents. Not surprisingly, the code for creating an editor is almost identical for these two document types: create an editor object and pass the document you want loaded into the editor.
The code to create an editor for a PDF document looks like this:
RadFixedDocumentEditor editor = new RadFixedDocumentEditor(doc);
The code for Word/RTF/HTML documents looks like this:
RadFlowDocumentEditor editor = new RadFlowDocumentEditor(doc);
That’s not to say that, as you start working with those documents, there aren’t going to be differences. These are, after all, very different kinds of documents. Having said that, some functionality does work in a similar way across all the document types.
If, for example, I want to search a PDF document for the text “ASP.NET,” I create a TextSearch instance from my document object. I then use the TextSearch object’s FindAll method to search for text in my PDF document, passing two things: my search text and a TextSearchOptions object that specifies how I want my search conducted. That FindAll object returns a collection of SearchResult objects that I can loop through.
Typical code, then, looks like this:
TextSearch search = new TextSearch(doc);
TextSearchOptions opts = new() {
CaseSensitive = false,
WholeWordsOnly = true,
UseRegularExpression = true
};
IEnumerable<SearchResult> items = search.FindAll("ASP.NET", opts);
Debug.Print($"Found {items.Count()} items.");
foreach (SearchResult item in items)
{
Debug.Print($"Found at {item.Range.StartPosition} ");
Debug.Print($"Found: {item.Result}");
}
The process is almost identical for the RadFlowDocument object, except:
FindAllmethod directly from the RadFlowDocumentEditor.TextSearch object are still available).FindAll method on the editor returns FindResult objects instead of SearchResult objects.As a result, the equivalent search code for a DOCX, RTF or HTML document looks like this:
RadFlowDocumentEditor editor = new RadFlowDocumentEditor(doc);
IEnumerable<FindResult> items = editor.FindAll("ASP.NET");
Debug.Print($"Found {items.Count()} items.");
foreach (FindResult item in items)
{
Debug.Print($"Found at {item.RelativeStartIndex} ");
Debug.Print($"Found: {item.FullMatchText}");
}
Searching a spreadsheet works similarly. The differences:
FindAll method is built right into the Workbook object.My find code with a workbook would look like this:
FindOptions opts = new() {
FindWhat = "ASP.NET",
MatchCase = false,
MatchEntireCellContents = true,
};
IEnumerable<FindResult> items = doc.FindAll(opts);
Debug.Print($"Found {items.Count()} items.");
foreach(FindResult item in items)
{
Debug.Print($"Found at {item.FoundCell.CellIndex} ");
Debug.Print($"Found: {item.ResultValue}");
}
One note: It is certainly convenient when objects from the different libraries share the same name (like the FindResult object that’s defined in both the RadFlowDocument and Workbook libraries). However, if you try to use both FindResult objects in the same code file, the compiler will get confused because the two objects are in different namespaces. In the unlikely case that you’re working with both Excel and Word documents in the same code file, you’ll have to fully qualify the object names—something I haven’t done in this post.
To save your modified documents back to disk, you just need to use the provider’s Export method. The code for all the document types is identical:
using (Stream str = File.Create(@"wwwroot/documents/Prioritiesnew.<filetype>"))
{
prov.Export(doc,str, TimeSpan.FromSeconds(10));
}
You can convert from one type in the suite to other types and, not surprisingly, the conversion processes look very much alike. As you’ve seen before, it often comes down to using the right provider.
If, for example, you want plain text version of your PDF file, you use the TextFormatProvider object’s Export method:
TextFormatProvider prov = new();
string text = prov.Export(doc, TimeSpan.FromSeconds(10));
For Word/HTML/RTF document types, the code is almost identical except it uses the TxtFormatProvider object:
TxtFormatProvider txProv = new();
string text = prov.Export(doc, TimeSpan.FromSeconds(10));
One note: There is a downside to having the classes that do similar things to different document types have the same name. If you are mixing document types and, as a result, using the Fixed, Flow and spreadsheet libraries in the same application, the compiler can get confused about which class from which library you’re using. If so, you’ll have to fully qualify your class names by including their namespaces in the class names. That makes for hard-to-read code, so I haven’t done that here.
But, as an example of how different documents require different functionality, you probably wouldn’t ever want to convert an Excel workbook to a string … but you might want to save your workbook as a CSV file. As you might expect by now, the code to save your imported workbook into a CSV file just means using the Export method on the appropriate provider—the CsvFormatProvider object in this case.
Typical code would look like this:
CsvFormatProvider prov = new();
using (Stream str = File.Create("Priority.csv"))
{
prov.Export(doc, str, TimeSpan.FromSeconds(10));
}
Converting any of these document types (Excel, Word, HTML, etc.) to PDF is equally straightforward. Because all the libraries look very much alike, it’s really just a matter of adding the library with the provider you need.
But you can also convert your Workbook object into a RadFixedDocument if you wanted to manipulate your spreadsheet as a PDF object. That conversion is handled by the PdfFormatProvider from the Telerik.Windows.Documents.Spreadsheet.Formatproviders.Pdf package and using its ExportToFixedDocumentmethod:
PdfFormatProvider prov = new();
RadFixedDocument fixedDoc =
prov.ExportToFixedDocument(doc, TimeSpan.FromSeconds(10));
The code is identical if you want to convert a Word/RTF/HTML document to a RadFixedDocument to work with it as a PDF file. That conversion also uses a PdfFormatProvider object but, this time, from the Telerik.Windows.Documents.Flow.FormatProviders.Pdf namespace.
But, because the providers from the two libraries have the same name, if you’re using both libraries in the same code file, you will need to fully qualify your provider names to make sure you’re getting the appropriate PdfFormatProvider.
Of course, once you start using these tools to create or modify the documents you’ve loaded, you’ll find more differences—the functionality in a spreadsheet is very different from the functionality in an HTML document. But, while the family resemblances among this suite won’t eliminate those differences, it does cut those differences down to what matters: how those documents differ in their functionality. Which is, after all, what you want.
Explore Telerik Document Processing Libraries, plus component libraries, reporting and more with a free trial of the Telerik DevCraft bundle:
Peter Vogel is both the author of the Coding Azure series and the instructor for Coding Azure in the Classroom. Peter’s company provides full-stack development from UX design through object modeling to database design. Peter holds multiple certifications in Azure administration, architecture, development and security and is a Microsoft Certified Trainer.