Import PDF data

4 posts, 0 answers
  1. Jesse
    Jesse avatar
    6 posts
    Member since:
    Jan 2008

    Posted 13 Feb 2015 Link to this post

    I would like to be able to read a PDF document and import the contents of tables within the PDF document.  Do you have an examples that shows how I would:

    • Open an existing PDF document
    • Read the document line at a time from top to bottom
    • Find tables in the document
    • Read the cell contents of those tables

      ?

     

  2. Deyan
    Admin
    Deyan avatar
    136 posts

    Posted 17 Feb 2015 Link to this post

    Hello Jesse,

    Thank you for contacting us.

    In order to export the text content of a PDF file you can use the TextFormatProvider class. The following code snippet shows how to import RadFixedDocument from file stream and after that export the document to text:
    RadFixedDocument document = new PdfFormatProvider(fileStream, FormatProviderSettings.ReadOnDemand).Import();
    string textDocument = new TextFormatProvider().Export(document);

    You should be aware that PDF is a fixed document format and its text is not preserved as flowing nor as tabular content. In order to export the file as string we internally use TextRecognizer which groups the text fragments into text lines, so the result text may not always be as accurate as you may expect. 

    I hope this is helpful. If you have any other questions or concerns please do not hesitate to contact us again.

    Regards,
    Deyan
    the Telerik team
     

    Check out the Telerik Platform - the only platform that combines a rich set of UI tools with powerful cloud services to develop web, hybrid and native mobile apps.

     
  3. Jesse
    Jesse avatar
    6 posts
    Member since:
    Jan 2008

    Posted 17 Feb 2015 in reply to Deyan Link to this post

    Thanks for the reply Deyan.

    Reading the text looks to be a good start, but it does not seem to preserve which column the text came out of.
    For instance both of these lines  (| indicates column breaks)
    2/1/2005   |        24       |    88      |            |      380     |     100    |
    2/1/2005   |        24       |              |    88    |      380     |     100    |

    Results in
    2/1/2005 24 88 380 100
    with a single space between strings, with no way to tell that there were empty columns.

    Is there way to determine which column the text came out of, or maybe the fixed X-Y coordinates of the string?

  4. Deyan
    Admin
    Deyan avatar
    136 posts

    Posted 18 Feb 2015 Link to this post

    Hello Jesse,

    As tables in the PDF file are preserved as a set of lines and glyphs there is no easy way to recognize table content in the document. The only way to get the text content in the document with the current version of RadPdfProcessing is to use the TextFormatProvider which converts the whole document to string.

    If you have other questions please contact us again.

    Regards,
    Deyan
    the Telerik team
     

    Check out the Telerik Platform - the only platform that combines a rich set of UI tools with powerful cloud services to develop web, hybrid and native mobile apps.

     
Back to Top