13 Answers, 1 is accepted
Hopefully something like this should work for you.
This is a copy/paste with slight alterations in a text editor, so might not be fully accurate but should give you something to go on.
Dim PDFText As String = GetPDFText()
If PDFText <> "" Then
Dim PDFLines() As String = PDFText.Split(CChar(Environment.NewLine))
'Do something
Else
'Failed to open pdf and retrieve text content
End If
Private Function GetPDFText() As String
Dim Result As String = ""
Dim br As BinaryReader = Nothing
Dim MS As MemoryStream = Nothing
Dim bData As Byte()
Dim RFD As RadFixedDocument
Try
br = New BinaryReader(File.OpenRead(Filename))
bData = br.ReadBytes(CInt(br.BaseStream.Length))
MS = New MemoryStream(bData, 0, bData.Length)
MS.Write(bData, 0, bData.Length)
RFD = New PdfFormatProvider(MS, FormatProviderSettings.ReadOnDemand).Import()
'Document Loaded
Try
'Processing Document for text
Dim TFP As New TextFormatProvider()
Result = TFP.Export(RFD)
Catch exText As Exception
'Incompatible PDF
End Try
Catch exMain As Exception
'Error processing PDF
Finally
MS.Dispose()
br.Dispose()
End Try
Return Result
End Function
Shaun's suggestion could help you extract the textual content of a PDF file. Note, however, that the TextFormatProvider and PdfFormatProvider classes (the latter when initialized with parameter settings) are related to the RadPdfViewer control.
Similar mechanism is not implemented for the PdfProcessing so far, although the feature is in our backlog. Still, if you want to get all text in a page you could use an approach similar to this one:
foreach
(var page
in
document.Pages)
{
foreach
(var contentElement
in
page.Content)
{
if
(contentElement
is
TextFragment)
{
string
text = (contentElement
as
TextFragment).Text;
}
}
}
Appending the strings together should result in the text within the document. Since PDF is a fixed flat document format, it does not preserve the text in it in a way a text-based format would. The search engines in PDF viewers use algorithms that try to obtain the original text, but this is not always possible and some loss of content might occur with both approaches.
I hope this helps.
Regards,
Petya
Telerik
See What's Next in App Development. Register for TelerikNEXT.
Hi
Thanks for your help.
Is it better to use the radpdfviewer instead?
I need to find a certain text, adjust it, and save it again.
Using RadPdfViewer you can only extract the text content from a PDF file. If you need to modify this text content the only possible approach is by using RadPdfProcessing, where you can change the text value of the imported text fragments and export the modified content.
Regards,
Deyan
Telerik
See What's Next in App Development. Register for TelerikNEXT.
Is this code still valid in recent versions? I tried to use this method of looping through, but found that the page.Content collection was always empty for every page (it did loop through the right number of pages, at least).
Maybe there is a new way of accomplishing this? This is the only method I could find in the documentation.
Thanks!
Jason
The code snippet is still valid and should enumerate all of the content elements on a page. Could you open a support ticket and send us the problematic document there, along with sample code (or project) that reproduces the issue, so that we can investigate on our side?
Regards,
Boby
Progress Telerik
A colleague helped me troubleshoot, and it turns out that the way the RadFixedDocument was instantiated matters:
// Doesn't work: Content is always empty (doc.Pages[0].Content.Count == 0)
RadFixedDocument doc = new PdfFormatProvider(myFileStream).Import();
// Works: I can enumerate doc.Pages[0].Content
RadFixedDocument doc = new PdfFormatProvider().Import(myFileStream);
I'm finding, however, that for some documents - notably, one producted by Telerik Reporting - the TextFragment.Text property has gobbledegook text. I'm guessing that it is compressed, or has a different encoding or something. (The RenderingResult.Encoding property is null, however).
Does anyone know how to retrieve the text as it appears in a PDF reader?
Rendering and encoding are different when it comes to PDF documents, so could you open a separate support ticket and send us the problematic document for further investigation on our side? We will treat in strictly confidential manner.
Regards,
Boby
Progress Telerik
i get below error when i do above code:
Unable to cast object of type 'Telerik.Windows.Documents.Fixed.FormatProviders.Pdf.Model.Types.PdfLiteralString' to type 'Telerik.Windows.Documents.Fixed.FormatProviders.Pdf.Model.Elements.Destinations.DestinationObject'.
Just to be clear code
RadFixedDocument doc = new PdfFormatProvider().Import(myFileStream);
gives "Unable to cast object of type 'Telerik.Windows.Documents.Fixed.FormatProviders.Pdf.Model.Types.PdfLiteralString' to type 'Telerik.Windows.Documents.Fixed.FormatProviders.Pdf.Model.Elements.Destinations.DestinationObject'." error.
version is 2018.2.511.40 for Telerik.Windows.Documents.Fixed dll.
any ideas how to solve this?
The error seems related to something specific in the document content that is causing issues while parsing the elements. Would it be possible to share a document which we can use to test and find out what might be causing issues? Please, note that the public forums allow you attaching only images. You can open a support ticket with us where you can attach archived files in case you would like to keep the document private.
Regards,
Tanya
Progress Telerik
Hi Tanya,
thanks for reply and i cant share document.
However,below code works without error but "HasContent" is always "false" and "Content" is always "0":
RadFixedDocument doc = new PdfFormatProvider(myFileStream,provider.ReadAllAtOnce).Import();
or
RadFixedDocument doc = new PdfFormatProvider(myFileStream,provider.ReadOnDemand).Import();
Hence, i tried the other(earlier mentioned) approach.
FYI, DLL version of Telerik.Windows.documents.Fixed.dll is 2018.2.511.40
and its the one which the organisation i am working has bought.
I am going to reiterate what we've already discussed in your support ticket, in case someone else finds the information useful. The most likely reason for the exception you mention is that the document contains Named Destinations, which are currently not supported. The content of the RadFixedDocument is empty because when the Import() overload is used, the RadPdfViewer control creates the object and, unlike RadPdfProcessing, it does not populate the Content collection.
Regards,
Anna
Progress Telerik