Walk through the page to find a certain text string

5 posts, 0 answers
  1. W
    W avatar
    17 posts
    Member since:
    Aug 2014

    Posted 01 May 2015 Link to this post


    I am trying to use pdfprocession to find a certain text string on the first page

    Is there a way to walk through the Page?


    Dim P As RadFixedPage = Doc.Pages.Item(0)

    P. ???

  2. Shaun
    Shaun avatar
    31 posts
    Member since:
    Jul 2011

    Posted 05 May 2015 in reply to W Link to this post

    Hopefully something like this should work for you.

    This is a copy/paste with slight alterations in a text editor, so might not be fully accurate but should give you something to go on.


    Dim PDFText As String = GetPDFText()
    If PDFText <> "" Then
         Dim PDFLines() As String = PDFText.Split(CChar(Environment.NewLine))
        'Do something
        'Failed to open pdf and retrieve text content
    End If




    Private Function GetPDFText() As String
        Dim Result As String = ""
        Dim br As BinaryReader = Nothing
        Dim MS As MemoryStream = Nothing
        Dim bData As Byte()
        Dim RFD As RadFixedDocument
            br = New BinaryReader(File.OpenRead(Filename))
            bData = br.ReadBytes(CInt(br.BaseStream.Length))
            MS = New MemoryStream(bData, 0, bData.Length)
            MS.Write(bData, 0, bData.Length)
            RFD = New PdfFormatProvider(MS, FormatProviderSettings.ReadOnDemand).Import()
            'Document Loaded
                'Processing Document for text
                Dim TFP As New TextFormatProvider()
                Result = TFP.Export(RFD)
            Catch exText As Exception
                'Incompatible PDF
            End Try
        Catch exMain As Exception
            'Error processing PDF
        End Try
        Return Result
    End Function


  3. Petya
    Petya avatar
    983 posts

    Posted 05 May 2015 Link to this post


    Shaun's suggestion could help you extract the textual content of a PDF file. Note, however, that the TextFormatProvider and PdfFormatProvider classes (the latter when initialized with parameter settings) are related to the RadPdfViewer control.

    Similar mechanism is not implemented for the PdfProcessing so far, although the feature is in our backlog. Still, if you want to get all text in a page you could use an approach similar to this one:
    foreach (var page in document.Pages)
        foreach (var contentElement in page.Content)
            if (contentElement is TextFragment)
                string text = (contentElement as TextFragment).Text;

    Appending the strings together should result in the text within the document. Since PDF is a fixed flat document format, it does not preserve the text in it in a way a text-based format would. The search engines in PDF viewers use algorithms that try to obtain the original text, but this is not always possible and some loss of content might occur with both approaches.

    I hope this helps.


    See What's Next in App Development. Register for TelerikNEXT.

  4. W
    W avatar
    17 posts
    Member since:
    Aug 2014

    Posted 06 May 2015 in reply to Petya Link to this post


    Thanks for your help.

    Is it better to use the radpdfviewer instead?

    I need to find a certain text, adjust it, and save it again.

  5. Deyan
    Deyan avatar
    158 posts

    Posted 08 May 2015 Link to this post


    Using RadPdfViewer you can only extract the text content from a PDF file. If you need to modify this text content the only possible approach is by using RadPdfProcessing, where you can change the text value of the imported text fragments and export the modified content.


    See What's Next in App Development. Register for TelerikNEXT.

Back to Top