I would like to be able to read a PDF document and import the contents of tables within the PDF document. Do you have an examples that shows how I would:
- Open an existing PDF document
- Read the document line at a time from top to bottom
- Find tables in the document
- Read the cell contents of those tables
?
17 Answers, 1 is accepted
Thank you for contacting us.
In order to export the text content of a PDF file you can use the TextFormatProvider class. The following code snippet shows how to import RadFixedDocument from file stream and after that export the document to text:
RadFixedDocument document =
new
PdfFormatProvider(fileStream, FormatProviderSettings.ReadOnDemand).Import();
string
textDocument =
new
TextFormatProvider().Export(document);
You should be aware that PDF is a fixed document format and its text is not preserved as flowing nor as tabular content. In order to export the file as string we internally use TextRecognizer which groups the text fragments into text lines, so the result text may not always be as accurate as you may expect.
I hope this is helpful. If you have any other questions or concerns please do not hesitate to contact us again.
Regards,
Deyan
the Telerik team
Check out the Telerik Platform - the only platform that combines a rich set of UI tools with powerful cloud services to develop web, hybrid and native mobile apps.
Reading the text looks to be a good start, but it does not seem to preserve which column the text came out of.
For instance both of these lines (| indicates column breaks)
2/1/2005 | 24 | 88 | | 380 | 100 |
2/1/2005 | 24 | | 88 | 380 | 100 |
Results in
2/1/2005 24 88 380 100
with a single space between strings, with no way to tell that there were empty columns.
Is there way to determine which column the text came out of, or maybe the fixed X-Y coordinates of the string?
As tables in the PDF file are preserved as a set of lines and glyphs there is no easy way to recognize table content in the document. The only way to get the text content in the document with the current version of RadPdfProcessing is to use the TextFormatProvider which converts the whole document to string.
If you have other questions please contact us again.
Regards,
Deyan
the Telerik team
Check out the Telerik Platform - the only platform that combines a rich set of UI tools with powerful cloud services to develop web, hybrid and native mobile apps.
Hello,
i have the following problem:when i write your code, i got the error, that FormatProviderSettings is not in the context. I already includer the references like Telerik.Windows.Documents.Core,Telerik.Windows.Documents.Fixed,Telerik.Windows.Zip,Telerik.Windows.Documents.Fixed.FormatProviders.Pdf, Telerik.Windows.Documents.Fixed.FormatProviders.Text and Telerik.Windows.Documents.Fixed.Model.
How can i resolve this problem?
It does not become clear why this issue would appear at first place. I would at first assume that you might have mixed the versions of the assemblies but since you have checked this, I would suggest to send us a sample runnable project reproducing the issue. As this is a forum post you could either attach the project by submitting a support ticket or share it by using a third party site for files sharing.
Regards,
Peshito
Progress Telerik
Hi,
Sorry to dig this up but I receive the following when executing the following:
using (var fileStream = File.OpenRead(path + "DemoPdfTelerik.pdf"))
{
RadFixedDocument document = new PdfFormatProvider(fileStream, FormatProviderSettings.ReadOnDemand).Import();
string textDocument = new TextFormatProvider().Export(document);
System.Diagnostics.Process.Start(path + "DemoPdfTelerik.txt");
}
'Telerik.Windows.Documents.Fixed.FormatProviders.Old.Pdf.DocumentModel.Data.PdfNameOld' to type 'Telerik.Windows.Documents.Fixed.FormatProviders.Old.Pdf.DocumentModel.Data.PdfDataStream'
Thanks
Similar errors are usually caused by something in the contents of the document. However, I am afraid that I cannot say what exactly the reason for the exception is without testing an example document. Would it be possible to provide one? Please, note that the public forums allow you to attach only images. If you would like to share the file privately, you can do so through our support ticketing system.
Regards,
Tanya
Progress Telerik
Hi Tanya,
Thanks for the update. My main goal is to be able pull the out the data from each AcroForm field in the pdf, which to be honest I'm not sure if it's possible?! (If you have any examples of this it would be great).
Please see use the following link to download the pdf example from my OneDrive account.
https://1drv.ms/b/s!AtJC-pYQjnotgjnCG5wmfuJ7j2zn
Best regards,
Marco
Don't know if this is the right approach to extracting Form Fields Name and Value.
It can pull out the FormField Name but it doesn't have a FormField Value property?
Ps: Val has that text in since It doesn't have a Value Property on Form Field.
using (var fileStream = File.OpenRead(path + "DemoPdfTelerik.pdf"))
{
RadFixedDocument document = new PdfFormatProvider(fileStream, FormatProviderSettings.ReadOnDemand).Import();
Dictionary<string, string> dictionary = new Dictionary<string, string>();
foreach (var widget in document.AcroForm.FormFields)
// foreach (KeyValuePair<string, string> item in dictionary)
{
var test = widget as FormField;
var name = widget.Name.ToString();
string val = "Need to find my value";
dictionary.Add(name, val);
}
using (TextWriter tw = File.CreateText(path + "Output New.txt"))
foreach (var entry in dictionary)
{
tw.WriteLine("[{0} {1}]", entry.Value, entry.Value);
}
Thank you for sharing the document.
I tested it and it seems like everything is working as expected. I also noticed that the document is created with PdfProcessing - is this the file causing you issues? If so, can you please confirm the version you use?
When it comes to working with the values of the form fields - they all implement the FormField base class but expose different properties depending on their specific type. You can find an example showing how to work with the values of the form fields in a PDF document in our SDK repository: Modify Forms.
Regards,
Tanya
Progress Telerik
Hi Tanya,
It's seems to be the file. I'm using Runtine version v4.0.30319 and Version 2018.3.904.40.
Best regards,
Marco
I tested the provided file with the specific version in the SDK example I linked in my previous reply but am still unable to find an issue in the functionality. I am attaching the test project - could you please test it on your end and let me know if you are still receiving an error? Is there something I am missing?
Regards,
Tanya
Progress Telerik
Hi Tanya,
Sorry for the late reply. I'm still facing the same issue even with your project.
Here's a link to another pdf file I cannot seem to figure it out.
https://swordoffice-my.sharepoint.com/:b:/r/personal/marco_wheller_sword-group_com/Documents/Documents/Contact%20Form%20test.pdf?csf=1&e=Gpakcw
Best regards,
Marco
Hi Tanya,
Seems that I have to open and save the PDF for it to work. Would you be able to help on this?
best regards,
Marco
It seems like I don't have the rights to see the contents of the link. Can you share it again?
Regards,
Tanya
Progress Telerik
Hi Tanya,
Sorry, please try this link https://1drv.ms/b/s!AtJC-pYQjnotgjvLxm2jKyPsOYXx
Best regards,
Marco
Thank you for the upload.
I was now able to download the document and test it. It turned out that there is an invalid value for the offset of the cross-reference table start. We have logged a task to handle similar cases: PdfProcessing: Handle import of documents with invalid cross-reference table offsets. There is attached a workaround to the public item, but please note that since the document is more complex (contains incremental updates), this workaround won't be applicable for this specific document. Re-saving it actually repairs it and could also remove the incremental updates inside so this is why it could help to import the document with PdfProcessing.
Hope this is helpful.
Regards,
Tanya
Progress Telerik