This is a migrated thread and some comments may be shown as answers.

Walk through the page to find a certain text string

13 Answers 609 Views
PdfProcessing
This is a migrated thread and some comments may be shown as answers.
W
Top achievements
Rank 1
W asked on 01 May 2015, 12:19 PM

Hi,

I am trying to use pdfprocession to find a certain text string on the first page

Is there a way to walk through the Page?

 

Dim P As RadFixedPage = Doc.Pages.Item(0)

P. ???

13 Answers, 1 is accepted

Sort by
0
Shaun
Top achievements
Rank 1
answered on 05 May 2015, 02:28 PM

Hopefully something like this should work for you.

This is a copy/paste with slight alterations in a text editor, so might not be fully accurate but should give you something to go on.

 

Dim PDFText As String = GetPDFText()
If PDFText <> "" Then
     Dim PDFLines() As String = PDFText.Split(CChar(Environment.NewLine))
    'Do something
Else
    'Failed to open pdf and retrieve text content
End If

 

 

 

Private Function GetPDFText() As String
    Dim Result As String = ""
 
    Dim br As BinaryReader = Nothing
    Dim MS As MemoryStream = Nothing
    Dim bData As Byte()
    Dim RFD As RadFixedDocument
    Try
        br = New BinaryReader(File.OpenRead(Filename))
        bData = br.ReadBytes(CInt(br.BaseStream.Length))
        MS = New MemoryStream(bData, 0, bData.Length)
        MS.Write(bData, 0, bData.Length)
 
        RFD = New PdfFormatProvider(MS, FormatProviderSettings.ReadOnDemand).Import()
         
        'Document Loaded
        Try
            'Processing Document for text
            Dim TFP As New TextFormatProvider()
            Result = TFP.Export(RFD)
        Catch exText As Exception
            'Incompatible PDF
        End Try
    Catch exMain As Exception
        'Error processing PDF
    Finally
        MS.Dispose()
        br.Dispose()
    End Try
    Return Result
End Function

 

0
Petya
Telerik team
answered on 05 May 2015, 03:58 PM
Hello,

Shaun's suggestion could help you extract the textual content of a PDF file. Note, however, that the TextFormatProvider and PdfFormatProvider classes (the latter when initialized with parameter settings) are related to the RadPdfViewer control.

Similar mechanism is not implemented for the PdfProcessing so far, although the feature is in our backlog. Still, if you want to get all text in a page you could use an approach similar to this one:
foreach (var page in document.Pages)
{
    foreach (var contentElement in page.Content)
    {
        if (contentElement is TextFragment)
        {
            string text = (contentElement as TextFragment).Text;
        }
    }
}

Appending the strings together should result in the text within the document. Since PDF is a fixed flat document format, it does not preserve the text in it in a way a text-based format would. The search engines in PDF viewers use algorithms that try to obtain the original text, but this is not always possible and some loss of content might occur with both approaches.

I hope this helps.

Regards,
Petya
Telerik
 

See What's Next in App Development. Register for TelerikNEXT.

 
0
W
Top achievements
Rank 1
answered on 06 May 2015, 05:41 AM

Hi

Thanks for your help.

Is it better to use the radpdfviewer instead?

I need to find a certain text, adjust it, and save it again.

0
Deyan
Telerik team
answered on 08 May 2015, 08:41 AM
Hello,

Using RadPdfViewer you can only extract the text content from a PDF file. If you need to modify this text content the only possible approach is by using RadPdfProcessing, where you can change the text value of the imported text fragments and export the modified content.

Regards,
Deyan
Telerik
 

See What's Next in App Development. Register for TelerikNEXT.

 
0
Jason
Top achievements
Rank 1
answered on 04 May 2018, 12:53 AM

Is this code still valid in recent versions? I tried to use this method of looping through, but found that the page.Content collection was always empty for every page (it did loop through the right number of pages, at least).

Maybe there is a new way of accomplishing this? This is the only method I could find in the documentation.

 

Thanks!

Jason

0
Boby
Telerik team
answered on 04 May 2018, 07:52 AM
Hello Jason,

The code snippet is still valid and should enumerate all of the content elements on a page. Could you open a support ticket and send us the problematic document there, along with sample code (or project) that reproduces the issue, so that we can investigate on our side?

Regards,
Boby
Progress Telerik
Want to extend the target reach of your WPF applications, leveraging iOS, Android, and UWP? Try UI for Xamarin, a suite of polished and feature-rich components for the Xamarin framework, which allow you to write beautiful native mobile apps using a single shared C# codebase.
0
Jason
Top achievements
Rank 1
answered on 10 May 2018, 02:17 AM

A colleague helped me troubleshoot, and it turns out that the way the RadFixedDocument was instantiated matters:

// Doesn't work: Content is always empty (doc.Pages[0].Content.Count == 0)
RadFixedDocument doc = new PdfFormatProvider(myFileStream).Import();
 
// Works: I can enumerate doc.Pages[0].Content
RadFixedDocument doc = new PdfFormatProvider().Import(myFileStream);

 

I'm finding, however, that for some documents - notably, one producted by Telerik Reporting - the TextFragment.Text property has gobbledegook text. I'm guessing that it is compressed, or has a different encoding or something. (The RenderingResult.Encoding property is null, however).

Does anyone know how to retrieve the text as it appears in a PDF reader?

0
Boby
Telerik team
answered on 14 May 2018, 07:57 AM
Hi Jason,

Rendering and encoding are different when it comes to PDF documents, so could you open a separate support ticket and send us the problematic document for further investigation on our side? We will treat in strictly confidential manner.

Regards,
Boby
Progress Telerik
Want to extend the target reach of your WPF applications, leveraging iOS, Android, and UWP? Try UI for Xamarin, a suite of polished and feature-rich components for the Xamarin framework, which allow you to write beautiful native mobile apps using a single shared C# codebase.
0
pravin
Top achievements
Rank 1
answered on 06 Jul 2018, 08:54 AM

i get below error when i do above code:

Unable to cast object of type 'Telerik.Windows.Documents.Fixed.FormatProviders.Pdf.Model.Types.PdfLiteralString' to type 'Telerik.Windows.Documents.Fixed.FormatProviders.Pdf.Model.Elements.Destinations.DestinationObject'.

0
pravin
Top achievements
Rank 1
answered on 06 Jul 2018, 08:57 AM

Just to be clear code 

RadFixedDocument doc = new PdfFormatProvider().Import(myFileStream);

gives "Unable to cast object of type 'Telerik.Windows.Documents.Fixed.FormatProviders.Pdf.Model.Types.PdfLiteralString' to type 'Telerik.Windows.Documents.Fixed.FormatProviders.Pdf.Model.Elements.Destinations.DestinationObject'." error.

 version is 2018.2.511.40 for Telerik.Windows.Documents.Fixed dll.

 

any ideas how to solve this?

0
Tanya
Telerik team
answered on 11 Jul 2018, 07:58 AM
Hello Pravin,

The error seems related to something specific in the document content that is causing issues while parsing the elements. Would it be possible to share a document which we can use to test and find out what might be causing issues? Please, note that the public forums allow you attaching only images. You can open a support ticket with us where you can attach archived files in case you would like to keep the document private.

Regards,
Tanya
Progress Telerik
Want to extend the target reach of your WPF applications, leveraging iOS, Android, and UWP? Try UI for Xamarin, a suite of polished and feature-rich components for the Xamarin framework, which you to write beautiful native mobile apps using a single shared C# codebase.
0
pravin
Top achievements
Rank 1
answered on 11 Jul 2018, 08:14 AM

Hi Tanya,

thanks for reply and i cant share document.

However,below code works without error but "HasContent" is always "false" and "Content" is always "0": 

RadFixedDocument doc = new PdfFormatProvider(myFileStream,provider.ReadAllAtOnce).Import();

or 

RadFixedDocument doc = new PdfFormatProvider(myFileStream,provider.ReadOnDemand).Import();

Hence, i tried the other(earlier mentioned) approach.

FYI, DLL version of Telerik.Windows.documents.Fixed.dll is 2018.2.511.40

and its the one which the organisation i am working has bought.

0
Anna
Telerik team
answered on 13 Jul 2018, 12:48 PM
Hi Pravin,

I am going to reiterate what we've already discussed in your support ticket, in case someone else finds the information useful. The most likely reason for the exception you mention is that the document contains Named Destinations, which are currently not supported. The content of the RadFixedDocument is empty because when the Import() overload is used, the RadPdfViewer control creates the object and, unlike RadPdfProcessing, it does not populate the Content collection.

Regards,
Anna
Progress Telerik
Want to extend the target reach of your WPF applications, leveraging iOS, Android, and UWP? Try UI for Xamarin, a suite of polished and feature-rich components for the Xamarin framework, which allow you to write beautiful native mobile apps using a single shared C# codebase.
Tags
PdfProcessing
Asked by
W
Top achievements
Rank 1
Answers by
Shaun
Top achievements
Rank 1
Petya
Telerik team
W
Top achievements
Rank 1
Deyan
Telerik team
Jason
Top achievements
Rank 1
Boby
Telerik team
pravin
Top achievements
Rank 1
Tanya
Telerik team
Anna
Telerik team
Share this question
or