This is a migrated thread and some comments may be shown as answers.

Import PDF data

17 Answers 1953 Views

PdfProcessing

This is a migrated thread and some comments may be shown as answers.

17 Answers, 1 is accepted

answered on 17 Feb 2015, 03:18 PM

Hello Jesse,

Thank you for contacting us.

In order to export the text content of a PDF file you can use the TextFormatProvider class. The following code snippet shows how to import RadFixedDocument from file stream and after that export the document to text:

RadFixedDocument document = new PdfFormatProvider(fileStream, FormatProviderSettings.ReadOnDemand).Import();

string textDocument = new TextFormatProvider().Export(document);

You should be aware that PDF is a fixed document format and its text is not preserved as flowing nor as tabular content. In order to export the file as string we internally use TextRecognizer which groups the text fragments into text lines, so the result text may not always be as accurate as you may expect.

I hope this is helpful. If you have any other questions or concerns please do not hesitate to contact us again.

Regards,
Deyan
the Telerik team

Check out the Telerik Platform - the only platform that combines a rich set of UI tools with powerful cloud services to develop web, hybrid and native mobile apps.

answered on 17 Feb 2015, 03:48 PM

Thanks for the reply Deyan.

Reading the text looks to be a good start, but it does not seem to preserve which column the text came out of.
For instance both of these lines (| indicates column breaks)
2/1/2005 | 24 | 88 | | 380 | 100 |
2/1/2005 | 24 | | 88 | 380 | 100 |

Results in
2/1/2005 24 88 380 100
with a single space between strings, with no way to tell that there were empty columns.

Is there way to determine which column the text came out of, or maybe the fixed X-Y coordinates of the string?

answered on 18 Mar 2019, 02:57 PM

Hi,

Sorry to dig this up but I receive the following when executing the following:

using (var fileStream = File.OpenRead(path + "DemoPdfTelerik.pdf"))
{
RadFixedDocument document = new PdfFormatProvider(fileStream, FormatProviderSettings.ReadOnDemand).Import();
string textDocument = new TextFormatProvider().Export(document);

System.Diagnostics.Process.Start(path + "DemoPdfTelerik.txt");
}

'Telerik.Windows.Documents.Fixed.FormatProviders.Old.Pdf.DocumentModel.Data.PdfNameOld' to type 'Telerik.Windows.Documents.Fixed.FormatProviders.Old.Pdf.DocumentModel.Data.PdfDataStream'

Thanks

answered on 21 Mar 2019, 11:15 AM

Don't know if this is the right approach to extracting Form Fields Name and Value.

It can pull out the FormField Name but it doesn't have a FormField Value property?

Ps: Val has that text in since It doesn't have a Value Property on Form Field.

using (var fileStream = File.OpenRead(path + "DemoPdfTelerik.pdf"))
{
RadFixedDocument document = new PdfFormatProvider(fileStream, FormatProviderSettings.ReadOnDemand).Import();

Dictionary<string, string> dictionary = new Dictionary<string, string>();
foreach (var widget in document.AcroForm.FormFields)
// foreach (KeyValuePair<string, string> item in dictionary)

{
var test = widget as FormField;
var name = widget.Name.ToString();
string val = "Need to find my value";

dictionary.Add(name, val);

}

using (TextWriter tw = File.CreateText(path + "Output New.txt"))
foreach (var entry in dictionary)
{
tw.WriteLine("[{0} {1}]", entry.Value, entry.Value);

}

answered on 25 Mar 2019, 10:37 AM

Hi Marco,

Thank you for sharing the document.

I tested it and it seems like everything is working as expected. I also noticed that the document is created with PdfProcessing - is this the file causing you issues? If so, can you please confirm the version you use?

When it comes to working with the values of the form fields - they all implement the FormField base class but expose different properties depending on their specific type. You can find an example showing how to work with the values of the form fields in a PDF document in our SDK repository: Modify Forms.

Regards,
Tanya
Progress Telerik

Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.

answered on 26 Mar 2019, 12:26 PM

ModifyForms.zip

Hi Marco,

I tested the provided file with the specific version in the SDK example I linked in my previous reply but am still unable to find an issue in the functionality. I am attaching the test project - could you please test it on your end and let me know if you are still receiving an error? Is there something I am missing?

Regards,
Tanya
Progress Telerik

Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.

answered on 16 Apr 2019, 08:10 PM

Hello Marco,

Thank you for the upload.

I was now able to download the document and test it. It turned out that there is an invalid value for the offset of the cross-reference table start. We have logged a task to handle similar cases: PdfProcessing: Handle import of documents with invalid cross-reference table offsets. There is attached a workaround to the public item, but please note that since the document is more complex (contains incremental updates), this workaround won't be applicable for this specific document. Re-saving it actually repairs it and could also remove the incremental updates inside so this is why it could help to import the document with PdfProcessing.

Hope this is helpful.

Regards,
Tanya
Progress Telerik

Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.