This is a migrated thread and some comments may be shown as answers.
Extract text
8 Answers 99 Views
This is a migrated thread and some comments may be shown as answers.
Guido
Top achievements
Rank 1
Guido asked on 05 Oct 2015, 10:31 AM

I need to load a PDF document and extract text.

Is this possible with PDFProcessing?

Thank you,

Guido

8 Answers, 1 is accepted

Sort by
0
Todor
Telerik team
answered on 05 Oct 2015, 02:50 PM
Hello Guido,

With our new 2015 Q3 release, we introduced the availability to export to plain text the content of a RadFixedDocument. More about the TextFormatProvider you can read in the attached document. The related help article is expected to be live later this week.

I hope this is helpful.
If you have further questions, please get back to us again.

Regards,
Todor
Telerik
Do you want to have your say when we set our development plans? Do you want to know when a feature you care about is added or when a bug fixed? Explore the Telerik Feedback Portal and vote to affect the priority of the items
0
Joachim
Top achievements
Rank 1
answered on 26 Oct 2019, 08:16 AM

Still I don't extract the plain text. So my example:

public void createPdf()
{
RadFixedDocument document = new RadFixedDocument();
RadFixedPage page = document.Pages.AddPage();

FixedContentEditor editor = new FixedContentEditor(page);
editor.DrawText("Hello RadPdfProcessing!");

PdfFormatProvider provider = new PdfFormatProvider();
using (Stream output = File.OpenWrite(@"C:\Temp\Hello.pdf"))
{
provider.Export(document, output);
}
}

public void import()
{
TxtFormatProvider provider = new TxtFormatProvider();
using (Stream input = File.OpenRead(@"C:\Temp\Hello.pdf"))
{
RadFlowDocument document = provider.Import(input);
RadFlowDocumentEditor editor = new RadFlowDocumentEditor(document);
string documentContent = provider.Export(document);
}
}
in documentContent I expected "Hello RadPdfProcessing!", but got: 
This document was generated by a trial version of Telerik Document Processing.
%PDF-1.7
%����
2 0 obj
<</Type /Catalog /Pages 3 0 R /Metadata 4 0 R /Names 5 0 R >>
endobj
3 0 obj
<</Type /Pages /Kids [6 0 R] /Count 1 >>
endobj
4 0 obj

Why?

0
Martin
Telerik team
answered on 30 Oct 2019, 03:36 PM

Hello Joachim,

Following the provided example, there are two options. The first one is to import the already exported PDF file using the PdfFormatProvider class so it can parse the content of the document

public RadFixedDocument ImportFromPdf()
{
	RadFixedDocument document = new RadFixedDocument();

	PdfFormatProvider provider = new PdfFormatProvider();
	using (Stream input = File.OpenRead("Hello.pdf"))
	{
		document = provider.Import(input);
	}

	return document;
}

 and after that, to export the already parsed content to plain text using TextFormatProvider:

public void ExportPdfAsTxt(RadFixedDocument document)
{
	TextFormatProvider provider = new TextFormatProvider();
	string documentContent = provider.Export(document);

	File.WriteAllText("Sample.txt", documentContent);
}

The other option is instead of exporting the RadFixedDocument to PDF file and importing it back, to use the RadFixedDocument instance (which you create in createPdf method) to directly export to plain text.

I hope this information is helpful.

Regards,
Martin
Progress Telerik

Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.
0
Mikko
Top achievements
Rank 1
answered on 10 Apr 2021, 07:01 PM

Hi,

Do you have a VB.net sample how to convert PDF to a text file?

I might be missing something, as I tried to use your Code Convertor.

Best regards, Mikko

 

0
Martin
Telerik team
answered on 12 Apr 2021, 10:33 AM

Hello Mikko,

I updated the code snippet in order to use VB instead of C#:

Dim document As RadFixedDocument

Using stream As Stream = File.OpenRead("SampleDocument.pdf")
	Dim pdfFormatProvider As PdfFormatProvider = New PdfFormatProvider()
	document = pdfFormatProvider.Import(stream)
End Using

Dim textFormatProvider As TextFormatProvider = New TextFormatProvider()
Dim documentContent = textFormatProvider.Export(document)
File.WriteAllText("TextFile.txt", documentContent)

Regards,
Martin
Progress Telerik

Love the Telerik and Kendo UI products and believe more people should try them? Invite a fellow developer to become a Progress customer and each of you can get a $50 Amazon gift voucher.

0
Mikko
Top achievements
Rank 1
answered on 12 Apr 2021, 10:42 AM

Hello Martin,

Thank you for your quick reply.

I get an error "Value of type 'Telerik.Windows.Documents.Flow.Model.RadFlowDocument' cannot be converted to 'Telerik.Windows.Documents.Fixed.Model.RadFixedDocument'.

I might be missing something?

Best regards, Mikko

 

 

 

0
Martin
Telerik team
answered on 12 Apr 2021, 11:15 AM

Hello Mikko,

It seems you have a reference to the WordsProcessing`s PdfFormatProvider in your project instead of PdfProcessing`s PdfFormatProvider. In this case, you will need to use the PdfProcessing`s PdfFormatProvider which is part of the Telerik.Windows.Documents.Fixed.FormatProviders.Pdf.PdfFormatProvider namespace: 

Dim pdfFormatProvider As Telerik.Windows.Documents.Fixed.FormatProviders.Pdf.PdfFormatProvider = 
	New Telerik.Windows.Documents.Fixed.FormatProviders.Pdf.PdfFormatProvider()

Regards,
Martin
Progress Telerik

Тhe web is about to get a bit better! 

The Progress Hack-For-Good Challenge has started. Learn how to enter and make the web a worthier place: https://progress-worthyweb.devpost.com.

0
Mikko
Top achievements
Rank 1
answered on 12 Apr 2021, 11:20 AM

Hi Martin,

That was it! Thanks for super fast and excellent support!

Best regards, Mikko

Asked by
Guido
Top achievements
Rank 1
Answers by
Todor
Telerik team
Joachim
Top achievements
Rank 1
Martin
Telerik team
Mikko
Top achievements
Rank 1
Share this question
or