This is a migrated thread and some comments may be shown as answers.

Weird characters converted from PDF

4 Answers 296 Views

PdfProcessing

This is a migrated thread and some comments may be shown as answers.

4 Answers, 1 is accepted

answered on 25 Oct 2017, 11:01 AM

PdfProcessingImportTest.zip

Hello,

Thank you for sharing the document.

I tested it but couldn't reproduce the behavior you are observing - the exported document looks exactly like the imported one. I am testing the latest version of the library. Which is the one you use? I am attaching a sample demonstrating the test - can you please check it and let me know if the issue is reproducible using this demo or if I am missing something?

Regards,
Tanya
Progress Telerik

Accepted

answered on 07 Nov 2017, 08:30 AM

Hi,

Thank you for the additional information.

The issue you are observing is related to an unsupported encoding mapping - ToUnicode CMap. We have already logged a request to implement it and you can vote for it using this public item.

From the code it seems like what you need to achieve is to extract the text from the document. If so, you can change the provider of PdfProcessing with the one used in RadPdfViewer for WPF and directly use TxtFormatProvider:

Dim provider As New PdfFormatProvider(File.OpenRead("../../sample_pdf.pdf"), FormatProviderSettings.ReadAllAtOnce)

Dim document As RadFixedDocument = provider.Import()

Dim txtProvider As New TextFormatProvider()

Dim documentTextContent = txtProvider.Export(document)

Please, have in mind that the RadFixedDocument object that is returned by this format provider is different than the one used in RadPdfProcessing. Although we are actively working on unifying both models (of the viewer and of the library), they are still different at this point.

Hope this helps.

Regards,
Tanya
Progress Telerik