This is a migrated thread and some comments may be shown as answers.

Import PDF data

17 Answers 1633 Views
PdfProcessing
This is a migrated thread and some comments may be shown as answers.
Jesse
Top achievements
Rank 1
Jesse asked on 13 Feb 2015, 05:50 PM

I would like to be able to read a PDF document and import the contents of tables within the PDF document.  Do you have an examples that shows how I would:

  • Open an existing PDF document
  • Read the document line at a time from top to bottom
  • Find tables in the document
  • Read the cell contents of those tables

    ?

 

17 Answers, 1 is accepted

Sort by
0
Deyan
Telerik team
answered on 17 Feb 2015, 03:18 PM
Hello Jesse,

Thank you for contacting us.

In order to export the text content of a PDF file you can use the TextFormatProvider class. The following code snippet shows how to import RadFixedDocument from file stream and after that export the document to text:
RadFixedDocument document = new PdfFormatProvider(fileStream, FormatProviderSettings.ReadOnDemand).Import();
string textDocument = new TextFormatProvider().Export(document);

You should be aware that PDF is a fixed document format and its text is not preserved as flowing nor as tabular content. In order to export the file as string we internally use TextRecognizer which groups the text fragments into text lines, so the result text may not always be as accurate as you may expect. 

I hope this is helpful. If you have any other questions or concerns please do not hesitate to contact us again.

Regards,
Deyan
the Telerik team
 

Check out the Telerik Platform - the only platform that combines a rich set of UI tools with powerful cloud services to develop web, hybrid and native mobile apps.

 
0
Jesse
Top achievements
Rank 1
answered on 17 Feb 2015, 03:48 PM
Thanks for the reply Deyan.

Reading the text looks to be a good start, but it does not seem to preserve which column the text came out of.
For instance both of these lines  (| indicates column breaks)
2/1/2005   |        24       |    88      |            |      380     |     100    |
2/1/2005   |        24       |              |    88    |      380     |     100    |

Results in
2/1/2005 24 88 380 100
with a single space between strings, with no way to tell that there were empty columns.

Is there way to determine which column the text came out of, or maybe the fixed X-Y coordinates of the string?

0
Deyan
Telerik team
answered on 18 Feb 2015, 04:11 PM
Hello Jesse,

As tables in the PDF file are preserved as a set of lines and glyphs there is no easy way to recognize table content in the document. The only way to get the text content in the document with the current version of RadPdfProcessing is to use the TextFormatProvider which converts the whole document to string.

If you have other questions please contact us again.

Regards,
Deyan
the Telerik team
 

Check out the Telerik Platform - the only platform that combines a rich set of UI tools with powerful cloud services to develop web, hybrid and native mobile apps.

 
0
Dmitri
Top achievements
Rank 1
answered on 28 Feb 2019, 01:40 PM

Hello,

i have the following problem:when i write your code, i got the error, that FormatProviderSettings is not in the context. I already includer the references like Telerik.Windows.Documents.Core,Telerik.Windows.Documents.Fixed,Telerik.Windows.Zip,Telerik.Windows.Documents.Fixed.FormatProviders.Pdf, Telerik.Windows.Documents.Fixed.FormatProviders.Text and Telerik.Windows.Documents.Fixed.Model.
How can i resolve this problem?

0
Peshito
Telerik team
answered on 05 Mar 2019, 11:25 AM
Hi Dmitri,

It does not become clear why this issue would appear at first place. I would at first assume that you might have mixed the versions of the assemblies but since you have checked this, I would suggest to send us a sample runnable project reproducing the issue. As this is a forum post you could either attach the project by submitting a support ticket or share it by using a third party site for files sharing.

Regards,
Peshito
Progress Telerik
Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.
0
Marco
Top achievements
Rank 1
answered on 18 Mar 2019, 02:57 PM

Hi, 

Sorry to dig this up but I receive the following when executing the following:

using (var fileStream = File.OpenRead(path + "DemoPdfTelerik.pdf"))
            {
                RadFixedDocument document = new PdfFormatProvider(fileStream, FormatProviderSettings.ReadOnDemand).Import();
                string textDocument = new TextFormatProvider().Export(document);

                System.Diagnostics.Process.Start(path + "DemoPdfTelerik.txt");
            }

            'Telerik.Windows.Documents.Fixed.FormatProviders.Old.Pdf.DocumentModel.Data.PdfNameOld' to type 'Telerik.Windows.Documents.Fixed.FormatProviders.Old.Pdf.DocumentModel.Data.PdfDataStream'

 

Thanks 

 

 

0
Tanya
Telerik team
answered on 21 Mar 2019, 10:03 AM
Hello Marco,

Similar errors are usually caused by something in the contents of the document. However, I am afraid that I cannot say what exactly the reason for the exception is without testing an example document. Would it be possible to provide one? Please, note that the public forums allow you to attach only images. If you would like to share the file privately, you can do so through our support ticketing system.

Regards,
Tanya
Progress Telerik
Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.
0
Marco
Top achievements
Rank 1
answered on 21 Mar 2019, 10:19 AM

Hi Tanya, 

Thanks for the update. My main goal is to be able pull the out the data from each AcroForm field in the pdf, which to be honest I'm not sure if it's possible?! (If you have any examples of this it would be great).

Please see use the following link to download the pdf example from my OneDrive account. 

https://1drv.ms/b/s!AtJC-pYQjnotgjnCG5wmfuJ7j2zn

Best regards,

Marco

 

0
Marco
Top achievements
Rank 1
answered on 21 Mar 2019, 11:15 AM

Don't know if this is the right approach to extracting Form Fields Name and Value. 

It can pull out the FormField Name but it doesn't have a FormField Value property?

 

Ps: Val has that text in since It doesn't have a Value Property on Form Field.

 

            using (var fileStream = File.OpenRead(path + "DemoPdfTelerik.pdf"))
            {
                RadFixedDocument document = new PdfFormatProvider(fileStream, FormatProviderSettings.ReadOnDemand).Import();

                Dictionary<string, string> dictionary = new Dictionary<string, string>();
                foreach (var widget in document.AcroForm.FormFields)
                   // foreach (KeyValuePair<string, string> item in dictionary)

                    {
                        var test = widget as FormField;
                        var name = widget.Name.ToString();
                        string val = "Need to find my value";

                        dictionary.Add(name, val);

                        }

                        using (TextWriter tw = File.CreateText(path + "Output New.txt"))
                        foreach (var entry in dictionary)
                        {
                            tw.WriteLine("[{0} {1}]", entry.Value, entry.Value);

                        }

0
Tanya
Telerik team
answered on 25 Mar 2019, 10:37 AM
Hi Marco,

Thank you for sharing the document.

I tested it and it seems like everything is working as expected. I also noticed that the document is created with PdfProcessing - is this the file causing you issues? If so, can you please confirm the version you use?

When it comes to working with the values of the form fields - they all implement the FormField base class but expose different properties depending on their specific type. You can find an example showing how to work with the values of the form fields in a PDF document in our SDK repository: Modify Forms.

Regards,
Tanya
Progress Telerik
Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.
0
Marco
Top achievements
Rank 1
answered on 25 Mar 2019, 01:32 PM

Hi Tanya, 

It's seems to be the file. I'm using Runtine version v4.0.30319 and Version 2018.3.904.40.

 

Best regards,

Marco

0
Tanya
Telerik team
answered on 26 Mar 2019, 12:26 PM
Hi Marco,

I tested the provided file with the specific version in the SDK example I linked in my previous reply but am still unable to find an issue in the functionality. I am attaching the test project - could you please test it on your end and let me know if you are still receiving an error? Is there something I am missing?

Regards,
Tanya
Progress Telerik
Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.
0
Marco
Top achievements
Rank 1
answered on 09 Apr 2019, 10:10 AM

Hi Tanya, 

 

Sorry for the late reply. I'm still facing the same issue even with your project. 

Here's a link to another pdf file I cannot seem to figure it out. 

 

https://swordoffice-my.sharepoint.com/:b:/r/personal/marco_wheller_sword-group_com/Documents/Documents/Contact%20Form%20test.pdf?csf=1&e=Gpakcw

Best regards,

Marco

 

 

0
Marco
Top achievements
Rank 1
answered on 10 Apr 2019, 01:23 PM

Hi Tanya,

Seems that I have to open and save the PDF for it to work. Would you be able to help on this? 

 

best regards,

Marco

0
Tanya
Telerik team
answered on 12 Apr 2019, 07:30 AM
Hi Marco,

It seems like I don't have the rights to see the contents of the link. Can you share it again? 

Regards,
Tanya
Progress Telerik
Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.
0
Marco
Top achievements
Rank 1
answered on 12 Apr 2019, 08:02 AM

Hi Tanya, 

Sorry, please try this link https://1drv.ms/b/s!AtJC-pYQjnotgjvLxm2jKyPsOYXx

 

Best regards,

Marco

0
Tanya
Telerik team
answered on 16 Apr 2019, 08:10 PM
Hello Marco,

Thank you for the upload.

I was now able to download the document and test it. It turned out that there is an invalid value for the offset of the cross-reference table start. We have logged a task to handle similar cases: PdfProcessing: Handle import of documents with invalid cross-reference table offsets. There is attached a workaround to the public item, but please note  that since the document is more complex (contains incremental updates), this workaround won't be applicable for this specific document. Re-saving it actually repairs it and could also remove the incremental updates inside so this is why it could help to import the document with PdfProcessing.

Hope this is helpful.

Regards,
Tanya
Progress Telerik
Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.
Tags
PdfProcessing
Asked by
Jesse
Top achievements
Rank 1
Answers by
Deyan
Telerik team
Jesse
Top achievements
Rank 1
Dmitri
Top achievements
Rank 1
Peshito
Telerik team
Marco
Top achievements
Rank 1
Tanya
Telerik team
Share this question
or