This is a migrated thread and some comments may be shown as answers.

Walk through the page to find a certain text string

13 Answers 839 Views

PdfProcessing

This is a migrated thread and some comments may be shown as answers.

13 Answers, 1 is accepted

answered on 05 May 2015, 02:28 PM

Hopefully something like this should work for you.

This is a copy/paste with slight alterations in a text editor, so might not be fully accurate but should give you something to go on.

Dim PDFText As String = GetPDFText()

If PDFText <> "" Then

Dim PDFLines() As String = PDFText.Split(CChar(Environment.NewLine))

'Do something

Else

'Failed to open pdf and retrieve text content

End If

Private Function GetPDFText() As String

Dim Result As String = ""

Dim br As BinaryReader = Nothing

Dim MS As MemoryStream = Nothing

Dim bData As Byte()

Dim RFD As RadFixedDocument

Try

br = New BinaryReader(File.OpenRead(Filename))

bData = br.ReadBytes(CInt(br.BaseStream.Length))

MS = New MemoryStream(bData, 0, bData.Length)

MS.Write(bData, 0, bData.Length)

RFD = New PdfFormatProvider(MS, FormatProviderSettings.ReadOnDemand).Import()

'Document Loaded

Try

'Processing Document for text

Dim TFP As New TextFormatProvider()

Result = TFP.Export(RFD)

Catch exText As Exception

'Incompatible PDF

End Try

Catch exMain As Exception

'Error processing PDF

Finally

MS.Dispose()

br.Dispose()

End Try

Return Result

End Function

answered on 05 May 2015, 03:58 PM

Hello,

Shaun's suggestion could help you extract the textual content of a PDF file. Note, however, that the TextFormatProvider and PdfFormatProvider classes (the latter when initialized with parameter settings) are related to the RadPdfViewer control.

Similar mechanism is not implemented for the PdfProcessing so far, although the feature is in our backlog. Still, if you want to get all text in a page you could use an approach similar to this one:

foreach (var page in document.Pages)

{

foreach (var contentElement in page.Content)

{

if (contentElement is TextFragment)

{

string text = (contentElement as TextFragment).Text;

}

Appending the strings together should result in the text within the document. Since PDF is a fixed flat document format, it does not preserve the text in it in a way a text-based format would. The search engines in PDF viewers use algorithms that try to obtain the original text, but this is not always possible and some loss of content might occur with both approaches.

I hope this helps.

Regards,
Petya
Telerik

See What's Next in App Development. Register for TelerikNEXT.

answered on 10 May 2018, 02:17 AM

A colleague helped me troubleshoot, and it turns out that the way the RadFixedDocument was instantiated matters:

// Doesn't work: Content is always empty (doc.Pages[0].Content.Count == 0)

RadFixedDocument doc = new PdfFormatProvider(myFileStream).Import();

// Works: I can enumerate doc.Pages[0].Content

RadFixedDocument doc = new PdfFormatProvider().Import(myFileStream);

I'm finding, however, that for some documents - notably, one producted by Telerik Reporting - the TextFragment.Text property has gobbledegook text. I'm guessing that it is compressed, or has a different encoding or something. (The RenderingResult.Encoding property is null, however).

Does anyone know how to retrieve the text as it appears in a PDF reader?

answered on 11 Jul 2018, 07:58 AM

Hello Pravin,

The error seems related to something specific in the document content that is causing issues while parsing the elements. Would it be possible to share a document which we can use to test and find out what might be causing issues? Please, note that the public forums allow you attaching only images. You can open a support ticket with us where you can attach archived files in case you would like to keep the document private.

Regards,
Tanya
Progress Telerik

Want to extend the target reach of your WPF applications, leveraging iOS, Android, and UWP? Try UI for Xamarin, a suite of polished and feature-rich components for the Xamarin framework, which you to write beautiful native mobile apps using a single shared C# codebase.

answered on 13 Jul 2018, 12:48 PM

Hi Pravin,

I am going to reiterate what we've already discussed in your support ticket, in case someone else finds the information useful. The most likely reason for the exception you mention is that the document contains Named Destinations, which are currently not supported. The content of the RadFixedDocument is empty because when the Import() overload is used, the RadPdfViewer control creates the object and, unlike RadPdfProcessing, it does not populate the Content collection.

Regards,
Anna
Progress Telerik

Want to extend the target reach of your WPF applications, leveraging iOS, Android, and UWP? Try UI for Xamarin, a suite of polished and feature-rich components for the Xamarin framework, which allow you to write beautiful native mobile apps using a single shared C# codebase.