This is a migrated thread and some comments may be shown as answers.

Weird characters converted from PDF

4 Answers 68 Views
PdfProcessing
This is a migrated thread and some comments may be shown as answers.
XiMnet Malaysia
Top achievements
Rank 1
XiMnet Malaysia asked on 24 Oct 2017, 02:32 AM

 

Hi,

We are using PdfProcessing to read PDF file.

The code used:

Dim document As RadFixedDocument = provider.Import(stream)

However, some of the text are converted to weird characters.

Here is the document used: http://upload.ximnet.com.my/huisheng/sample_pdf.pdf

The text are converted from "DATO’ SERI AHMAD HUSNI MOHAMAD HANADZLAH" to "'A72’ 6(5, A+0A' +861, 02+A0A' +A1A'ZLAH"

Is there any setting that we need to set in the code?

Thanks.

 

4 Answers, 1 is accepted

Sort by
0
Tanya
Telerik team
answered on 25 Oct 2017, 11:01 AM
Hello,

Thank you for sharing the document.

I tested it but couldn't reproduce the behavior you are observing - the exported document looks exactly like the imported one. I am testing the latest version of the library. Which is the one you use? I am attaching a sample demonstrating the test - can you please check it and let me know if the issue is reproducible using this demo or if I am missing something?

Regards,
Tanya
Progress Telerik

0
XiMnet Malaysia
Top achievements
Rank 1
answered on 02 Nov 2017, 07:58 AM

Hi Tanya,

We are trying to extract the text to be indexed by our search engine.
The code that we used are:

For Each page As RadFixedPage In document.Pages
            For Each elem As ContentElementBase In page.Content
                textDocument = textDocument + elem.ToString
            Next
        Next

 

We get the weird character in the string. Is this the correct way to extract text from PDF?

Thanks.

 

0
Accepted
Tanya
Telerik team
answered on 07 Nov 2017, 08:30 AM
Hi,

Thank you for the additional information.

The issue you are observing is related to an unsupported encoding mapping - ToUnicode CMap. We have already logged a request to implement it and you can vote for it using this public item.

From the code it seems like what you need to achieve is to extract the text from the document. If so, you can change the provider of PdfProcessing with the one used in RadPdfViewer for WPF and directly use TxtFormatProvider:

Dim provider As New PdfFormatProvider(File.OpenRead("../../sample_pdf.pdf"), FormatProviderSettings.ReadAllAtOnce)
Dim document As RadFixedDocument = provider.Import()
 
Dim txtProvider As New TextFormatProvider()
Dim documentTextContent = txtProvider.Export(document)

Please, have in mind that the RadFixedDocument object that is returned by this format provider is different than the one used in RadPdfProcessing. Although we are actively working on unifying both models (of the viewer and of the library), they are still different at this point.

Hope this helps.

Regards,
Tanya
Progress Telerik

0
XiMnet Malaysia
Top achievements
Rank 1
answered on 07 Nov 2017, 09:14 AM

Thanks, Tanya.

We will use the TextFormatProvider for now.

 

Tags
PdfProcessing
Asked by
XiMnet Malaysia
Top achievements
Rank 1
Answers by
Tanya
Telerik team
XiMnet Malaysia
Top achievements
Rank 1
Share this question
or