We are using PdfProcessing to read PDF file.
The code used:
However, some of the text are converted to weird characters.
Here is the document used: http://upload.ximnet.com.my/huisheng/sample_pdf.pdf
The text are converted from "DATO’ SERI AHMAD HUSNI MOHAMAD HANADZLAH" to "'A72’ 6(5, A+0A' +861, 02+A0A' +A1A'ZLAH"
Is there any setting that we need to set in the code?
4 Answers, 1 is accepted
Thank you for sharing the document.
I tested it but couldn't reproduce the behavior you are observing - the exported document looks exactly like the imported one. I am testing the latest version of the library. Which is the one you use? I am attaching a sample demonstrating the test - can you please check it and let me know if the issue is reproducible using this demo or if I am missing something?
We are trying to extract the text to be indexed by our search engine.
The code that we used are:
We get the weird character in the string. Is this the correct way to extract text from PDF?
Thank you for the additional information.
The issue you are observing is related to an unsupported encoding mapping - ToUnicode CMap. We have already logged a request to implement it and you can vote for it using this public item.
From the code it seems like what you need to achieve is to extract the text from the document. If so, you can change the provider of PdfProcessing with the one used in RadPdfViewer for WPF and directly use TxtFormatProvider:
Please, have in mind that the RadFixedDocument object that is returned by this format provider is different than the one used in RadPdfProcessing. Although we are actively working on unifying both models (of the viewer and of the library), they are still different at this point.
Hope this helps.
We will use the TextFormatProvider for now.