PDF Stream Processing—Reliable and Efficient Processing of PDF Files

by Deyan Yosifov

Last Updated: May 27, 2021 Published: February 01, 2017 4 min read Productivity, Document Processing 0 Comments

PDF Stream Processing—Reliable and Efficient Processing_870x220

An innovative approach for fast, memory efficient and reliable document processing of PDF files—part of the latest Document Processing Library in R1 2017.

Have you ever been required to work with large amounts of PDF files? Have you had the task of processing the pages of these files and either merge, split or add new pages, or add content to the existing pages? Last, but not least, as these files may contain sensitive clients' data, have you needed to guarantee the reliability of your application, ensuring that all page content is preserved unmodified? Well, then you should definitely look into our R1 2017 release, as we've just unveiled our best tool ever for handling these document processing scenarios!

The new PDF Stream Writer functionality (available for UI for ASP.NET AJAX and MVC, as well as UI for WPF, WinForms and Silverlight) is an addition to the RadPdfProcessing library and provides a completely new approach for dealing with PDF files. While the previous approach relied on building a PDF document model in memory through the RadFixedDocument class, the new API allows you to read and write directly to the PDF file streams without keeping unnecessary data in the memory. This innovative stream processing approach is the key to the remarkable results, which the new API shows.

These results may be summarized with three benefits—great performance, minimized memory footprint and guaranteed reliability.

How to Use PDFStreamWriter

So, how to use the new API? It generally provides four main new classes— PdfStreamWriter, PdfPageStreamWriter, PdfFileSource and PdfPageSource. The first two are responsible for writing the new PDF file and the pages in the new PDF file respectively. The next two classes are responsible for reading existing PDF files and existing PDF pages.

For example, if you have several PDF files and need to merge their pages into a newly created PDF file then you can use code similar to the one below:

// Create a PdfStreamWriter instance, responsible to write the document into the specified file

using (PdfStreamWriter fileWriter = new PdfStreamWriter(File.OpenWrite(resultFileName)))

{

// Iterate through the files you would like to merge

foreach(string documentName in documentsToMerge)

{

// Open each of the files

using (PdfFileSource fileToMerge = new PdfFileSource(File.OpenRead(documentName)))

{

// Iterate through the pages of the current document

foreach(PdfPageSource pageToMerge in fileToMerge.Pages)

{

// Append the current page to the fileWriter, which holds the result FileStream

fileWriter.WritePage(pageToMerge);

}

If you have a more complex scenario, in which you need to position multiple pages' content on a single page, then instead of calling WritePage method you can use the BeginPage method as described below:

Size size = new Size(700,1200);

Rotation rotation = Rotation.Rotate270;

using (PdfPageStreamWriter pageWriter = fileWriter.BeginPage(size, rotation))

{

// Use the pageWriter object to fill the content of the page.

}

This allows you to combine and position both existing pages' content with PdfPageSource class or newly generated RadFixedPage content created by using the existing RadPdfProcessing editing API. More examples for merging, splitting or combining page content can be seen in this ManipulatePages SDK example.

Performance Efficiency

PdfStreamWriter and PdfFileSource classes write and read PDF objects directly to and from PDF file streams. These PDF objects are simply copied without needing to decompress the PDF data and can be additionally reused when possible. Both facts guarantee maximized performance of the PdfStreamWriter class. As an example you may take a look at this PdfStreamWriterPerformance SDK that shows how a single-paged PDF file is merged 10000 times and the resulting PDF file is generated for less than a second! Impressive, right?

Memory Footprint

The idea for reading and writing from and to FileStream instances is the key for the low memory usage. The only memory used is for copying objects from one PDF file to the resulting one. However, PdfStreamWriter, PdfPageStreamWriter and PdfFileSource classes implement the IDisposable interface and all resources are released at the same time, as you no longer need them, which guarantees a minimized memory footprint. That being said, writing a single-paged PDF file and writing a multi-paged PDF file will consume practically the same amount of memory as each page is written directly to the resulting stream when ready.

Reliability and PDF Features Support

PdfStreamWriter simply copies page content and the related resources from one file to another. This means that the new API does not depend on understanding any complex PDF features and it supports practically all page-related PDF features. This guarantees reliability in preserving the existing content, without modifying it or losing any data.

As an example you may take a look at the PDF file from the picture below. This file contains sound, video and 3D interactive content, which are unsupported in the previous RadPdfProcessing model. However, as PdfStreamWriter is independent from the model, it successfully preserves all page content after processing it.

Merging this file with pages from other PDF files may be seen in ManipulatePages SDK example.

Learn More

Now that you already know about these performance, memory efficient and reliable PDF Stream Processing functionalities, you might be interested to learn more information about its usage. In this case, take a look at our documentation where you can find a detailed description of what can and what cannot be achieved with this new API, as well as different settings that can be used in order to customize the produced PDF files. Additionally, make sure to check out our Github SDK examples.

Let us know in the comments below whether you find this API useful and if you think it needs additional features or improvements in the future. If you are interested in learning more about our Document Processing Libraries, be sure to take a look at the related blog post articles below.

PDF, Performance, Telerik Document Processing Libraries

About the Author

Deyan Yosifov

Deyan is an architect, principal software developer and mathematics enthusiast. He joined the Telerik team in 2013 and has since participated in the development of several different projects—Document Processing Libraries, RadPdfViewer and RadSpreadProcessing WPF controls, and most recently in Telerik AR/VR. He is passionate about 3D technologies and loves solving challenging problems.

Comments

Comments are disabled in preview mode.

All articles

Topics

Web Mobile Desktop Design Productivity People

Latest Stories
in Your Inbox

Subscribe to be the first to get our expert-written articles and tutorials for developers!

All fields are required

Country/Territory

Blog