XiMnet Malaysia
Top achievements
Rank 1
XiMnet Malaysia
asked on 24 Oct 2017, 02:32 AM
Hi,
We are using PdfProcessing to read PDF file.
The code used:
Dim document As RadFixedDocument = provider.Import(stream)
However, some of the text are converted to weird characters.
Here is the document used: http://upload.ximnet.com.my/huisheng/sample_pdf.pdf
The text are converted from "DATO’ SERI AHMAD HUSNI MOHAMAD HANADZLAH" to "'A72’ 6(5, A+0A' +861, 02+A0A' +A1A'ZLAH"
Is there any setting that we need to set in the code?
Thanks.
4 Answers, 1 is accepted
0
Hello,
Thank you for sharing the document.
I tested it but couldn't reproduce the behavior you are observing - the exported document looks exactly like the imported one. I am testing the latest version of the library. Which is the one you use? I am attaching a sample demonstrating the test - can you please check it and let me know if the issue is reproducible using this demo or if I am missing something?
Regards,
Tanya
Progress Telerik
Thank you for sharing the document.
I tested it but couldn't reproduce the behavior you are observing - the exported document looks exactly like the imported one. I am testing the latest version of the library. Which is the one you use? I am attaching a sample demonstrating the test - can you please check it and let me know if the issue is reproducible using this demo or if I am missing something?
Regards,
Tanya
Progress Telerik
0
XiMnet Malaysia
Top achievements
Rank 1
answered on 02 Nov 2017, 07:58 AM
Hi Tanya,
We are trying to extract the text to be indexed by our search engine.
The code that we used are:
For
Each
page
As
RadFixedPage
In
document.Pages
For
Each
elem
As
ContentElementBase
In
page.Content
textDocument = textDocument + elem.ToString
Next
Next
We get the weird character in the string. Is this the correct way to extract text from PDF?
Thanks.
0
Accepted
Hi,
Thank you for the additional information.
The issue you are observing is related to an unsupported encoding mapping - ToUnicode CMap. We have already logged a request to implement it and you can vote for it using this public item.
From the code it seems like what you need to achieve is to extract the text from the document. If so, you can change the provider of PdfProcessing with the one used in RadPdfViewer for WPF and directly use TxtFormatProvider:
Please, have in mind that the RadFixedDocument object that is returned by this format provider is different than the one used in RadPdfProcessing. Although we are actively working on unifying both models (of the viewer and of the library), they are still different at this point.
Hope this helps.
Regards,
Tanya
Progress Telerik
Thank you for the additional information.
The issue you are observing is related to an unsupported encoding mapping - ToUnicode CMap. We have already logged a request to implement it and you can vote for it using this public item.
From the code it seems like what you need to achieve is to extract the text from the document. If so, you can change the provider of PdfProcessing with the one used in RadPdfViewer for WPF and directly use TxtFormatProvider:
Dim
provider
As
New
PdfFormatProvider(File.OpenRead(
"../../sample_pdf.pdf"
), FormatProviderSettings.ReadAllAtOnce)
Dim
document
As
RadFixedDocument = provider.Import()
Dim
txtProvider
As
New
TextFormatProvider()
Dim
documentTextContent = txtProvider.Export(document)
Please, have in mind that the RadFixedDocument object that is returned by this format provider is different than the one used in RadPdfProcessing. Although we are actively working on unifying both models (of the viewer and of the library), they are still different at this point.
Hope this helps.
Regards,
Tanya
Progress Telerik
0
XiMnet Malaysia
Top achievements
Rank 1
answered on 07 Nov 2017, 09:14 AM
Thanks, Tanya.
We will use the TextFormatProvider for now.