Over the past few days I worked on an image caching mechanism in the PDF Rendering Extension and I am happy to announce that the achieved output size reduction was more than significant.
When a report contains the same binary image many times, the PDF Rendering Extension has no way of knowing that it is actually one and the same and renders it many times thus wasting disc space. For an illustration of this case imagine that you have your company's logo in the page header and your report is 200 pages long.
I've implemented a central storage for all images that need to be rendered in PDF. It is a simple generic Dictionary. For a key I used a class that stores a reference to the original System.Drawing.Image and the source rectangle from that image that needs to be drawn into the PDF document. These two things uniquely identify a PDF image. Now, in order to use that class as a key for my dictionary, I overrode the GetHashCode and Equals methods. The hashcode for the object is calculated very easily:
Now CRC32 might sound intimidating, but in fact it is a well known algorithm and is implemented on about 20-30 lines of code. Now, each time a new image arrives its hash is computed, the dictionary is searched to see whether we already have it in there, and if we have it we simply use the existing image. Of course .NET does all of the above for us, since we've done our job by overriding the Equals and GetHashCode methods of the key class.
Out of curiosity, I decided to export all of our sample reports to PDF and compare the results with and without caching. With image caching the output size fell by 45% which is sweet, but there is more. Often there is a tradeoff between speed and size. I expected the rendering speed to decrease since now I am doing checks in the image cache each time a new image arrives. But the speed increased by 4%. How can that be?
The explanation is simple. The cache checks are slowing the whole thing and there is no doubt about that. However, each time an image is streamed to PDF, some meta-data has to be read from it in order to determine color spaces, palletes, etc. The reading of this meta-data takes some amount of time. Now, since we have an image cache, meta-data extraction will happen only once for each image in the cache, unlike before when it occurred for every image, no matter whether it was redundant.
So in the end we have achieved a considerable size reduction topped with a slight speed increase. I have already checked-in these changes, so they should be available in the next release. Enjoy!