This is a migrated thread and some comments may be shown as answers.

Extract text

9 Answers 1589 Views
PdfProcessing
This is a migrated thread and some comments may be shown as answers.
Guido
Top achievements
Rank 1
Guido asked on 05 Oct 2015, 10:31 AM

I need to load a PDF document and extract text.

Is this possible with PDFProcessing?

Thank you,

Guido

9 Answers, 1 is accepted

Sort by
0
Todor
Telerik team
answered on 05 Oct 2015, 02:50 PM
Hello Guido,

With our new 2015 Q3 release, we introduced the availability to export to plain text the content of a RadFixedDocument. More about the TextFormatProvider you can read in the attached document. The related help article is expected to be live later this week.

I hope this is helpful.
If you have further questions, please get back to us again.

Regards,
Todor
Telerik
Do you want to have your say when we set our development plans? Do you want to know when a feature you care about is added or when a bug fixed? Explore the Telerik Feedback Portal and vote to affect the priority of the items
Bhavya
Top achievements
Rank 1
commented on 07 Jun 2022, 10:38 AM

Hi Todor,

Is there anyway to retrieve specific values from pdf and validate the values in that pdf?

 

Thanks and Regards,

Bhavya.

0
Joachim
Top achievements
Rank 1
answered on 26 Oct 2019, 08:16 AM

Still I don't extract the plain text. So my example:

public void createPdf()
{
RadFixedDocument document = new RadFixedDocument();
RadFixedPage page = document.Pages.AddPage();

FixedContentEditor editor = new FixedContentEditor(page);
editor.DrawText("Hello RadPdfProcessing!");

PdfFormatProvider provider = new PdfFormatProvider();
using (Stream output = File.OpenWrite(@"C:\Temp\Hello.pdf"))
{
provider.Export(document, output);
}
}

public void import()
{
TxtFormatProvider provider = new TxtFormatProvider();
using (Stream input = File.OpenRead(@"C:\Temp\Hello.pdf"))
{
RadFlowDocument document = provider.Import(input);
RadFlowDocumentEditor editor = new RadFlowDocumentEditor(document);
string documentContent = provider.Export(document);
}
}
in documentContent I expected "Hello RadPdfProcessing!", but got: 
This document was generated by a trial version of Telerik Document Processing.
%PDF-1.7
%����
2 0 obj
<</Type /Catalog /Pages 3 0 R /Metadata 4 0 R /Names 5 0 R >>
endobj
3 0 obj
<</Type /Pages /Kids [6 0 R] /Count 1 >>
endobj
4 0 obj

Why?

0
Martin
Telerik team
answered on 30 Oct 2019, 03:36 PM

Hello Joachim,

Following the provided example, there are two options. The first one is to import the already exported PDF file using the PdfFormatProvider class so it can parse the content of the document

public RadFixedDocument ImportFromPdf()
{
	RadFixedDocument document = new RadFixedDocument();

	PdfFormatProvider provider = new PdfFormatProvider();
	using (Stream input = File.OpenRead("Hello.pdf"))
	{
		document = provider.Import(input);
	}

	return document;
}

 and after that, to export the already parsed content to plain text using TextFormatProvider:

public void ExportPdfAsTxt(RadFixedDocument document)
{
	TextFormatProvider provider = new TextFormatProvider();
	string documentContent = provider.Export(document);

	File.WriteAllText("Sample.txt", documentContent);
}

The other option is instead of exporting the RadFixedDocument to PDF file and importing it back, to use the RadFixedDocument instance (which you create in createPdf method) to directly export to plain text.

I hope this information is helpful.

Regards,
Martin
Progress Telerik

Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.
0
Mikko
Top achievements
Rank 1
Veteran
answered on 10 Apr 2021, 07:01 PM

Hi,

Do you have a VB.net sample how to convert PDF to a text file?

I might be missing something, as I tried to use your Code Convertor.

Best regards, Mikko

 

0
Martin
Telerik team
answered on 12 Apr 2021, 10:33 AM

Hello Mikko,

I updated the code snippet in order to use VB instead of C#:

Dim document As RadFixedDocument

Using stream As Stream = File.OpenRead("SampleDocument.pdf")
	Dim pdfFormatProvider As PdfFormatProvider = New PdfFormatProvider()
	document = pdfFormatProvider.Import(stream)
End Using

Dim textFormatProvider As TextFormatProvider = New TextFormatProvider()
Dim documentContent = textFormatProvider.Export(document)
File.WriteAllText("TextFile.txt", documentContent)

Regards,
Martin
Progress Telerik

Love the Telerik and Kendo UI products and believe more people should try them? Invite a fellow developer to become a Progress customer and each of you can get a $50 Amazon gift voucher.

0
Mikko
Top achievements
Rank 1
Veteran
answered on 12 Apr 2021, 10:42 AM

Hello Martin,

Thank you for your quick reply.

I get an error "Value of type 'Telerik.Windows.Documents.Flow.Model.RadFlowDocument' cannot be converted to 'Telerik.Windows.Documents.Fixed.Model.RadFixedDocument'.

I might be missing something?

Best regards, Mikko

 

 

 

0
Martin
Telerik team
answered on 12 Apr 2021, 11:15 AM

Hello Mikko,

It seems you have a reference to the WordsProcessing`s PdfFormatProvider in your project instead of PdfProcessing`s PdfFormatProvider. In this case, you will need to use the PdfProcessing`s PdfFormatProvider which is part of the Telerik.Windows.Documents.Fixed.FormatProviders.Pdf.PdfFormatProvider namespace: 

Dim pdfFormatProvider As Telerik.Windows.Documents.Fixed.FormatProviders.Pdf.PdfFormatProvider = 
	New Telerik.Windows.Documents.Fixed.FormatProviders.Pdf.PdfFormatProvider()

Regards,
Martin
Progress Telerik

Тhe web is about to get a bit better! 

The Progress Hack-For-Good Challenge has started. Learn how to enter and make the web a worthier place: https://progress-worthyweb.devpost.com.

0
Mikko
Top achievements
Rank 1
Veteran
answered on 12 Apr 2021, 11:20 AM

Hi Martin,

That was it! Thanks for super fast and excellent support!

Best regards, Mikko

0
Karol
Top achievements
Rank 1
Iron
answered on 05 Oct 2022, 12:11 PM

We created some simple code for conversion:


RadFixedDocument pdf_document = new RadFixedDocument();

            PdfFormatProvider pdf_provider = new PdfFormatProvider();
            using (Stream input = File.OpenRead("c:\\temp\\aa.pdf"))
            {
                pdf_document = pdf_provider.Import(input);
                TextFormatProvider text_provider = new TextFormatProvider();
                string file_content = text_provider.Export(pdf_document);
                File.WriteAllText("c:\\temp\\Sample.txt", file_content);
            }

But there are diffrences:

Our PDF file is file after OCR process. So it has Searchable mask. When I CTRL+A, CTRL+C  on the PDF document and then notepad CTRL+V. My result was something like that:

ponieważ zobowiązany nie figuruje w naszej bazie danych.

And the result from the code above is :

poniewa ż zobowiązan y nie figuruje w naszej bazie danych.

As you can see - some letters are missing. This is happening on .NET standard Version.

I've tested the same pdf file on the telerik library for WPF ver. 2016.2.606.45 and the result between CTRL+A, CTRL+C and telerik export to txt was identical.

What may be the problem? I've tested telerik on .NET5 and .NET Framework - the same result

Karol Dobek

Martin
Telerik team
commented on 06 Oct 2022, 08:54 AM

Hello Karol,

The code snippet looks fine.

This behavior could be related to the PdfProcessing`s text recognition internal logic but in order to deeper investigate the case I would like to ask you to open a Support Ticket and share the document with us. I must assure you we treat all client files strictly confidential and for testing purposes only.

Tags
PdfProcessing
Asked by
Guido
Top achievements
Rank 1
Answers by
Todor
Telerik team
Joachim
Top achievements
Rank 1
Martin
Telerik team
Mikko
Top achievements
Rank 1
Veteran
Karol
Top achievements
Rank 1
Iron
Share this question
or