This is a migrated thread and some comments may be shown as answers.

Get X,Y cordinates of word/text on PDF

15 Answers 755 Views
PdfViewer and PdfViewerNavigator
This is a migrated thread and some comments may be shown as answers.
André
Top achievements
Rank 1
André asked on 26 Mar 2019, 07:16 AM

Hi,

I need to build a solution to export text from PDF including their X,Y cordinates. 

Could it be done with PDFViewer or other Telerik product?

Best regards,
André

15 Answers, 1 is accepted

Sort by
0
Tanya
Telerik team
answered on 28 Mar 2019, 01:22 PM
Hi André,

You can traverse the content of a PDF document and check only the elements containing text using the PdfProcessing library. What you would need to do is to import the document and then iterate through all the Content of each RadFixedPage:
foreach (var page in this.pdfDocument.Pages)
{
    foreach (var item in page.Content)
    {
        var textFragment = item as Telerik.Windows.Documents.Fixed.Model.Text.TextFragment;
        if (textFragment != null)
        {
            var position = textFragment.Position;
        }
    }
}

Hope this is helpful.

Regards,
Tanya
Progress Telerik
Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.
0
André
Top achievements
Rank 1
answered on 10 Feb 2020, 12:20 PM

I was unable to test it earlier, but I think this is what I am looking for.

I am still just a starter, so may be a simple question:

How can I translate "var position" to positions in a string? I read about Matrix en IPosition, but only to {set} and not to {get}

 

I hope you can help me.

 

Best regards,

André

0
Tanya
Telerik team
answered on 12 Feb 2020, 02:40 PM

Hi André,

I would like to first start with a clarification on the format specifics so I can ensure we are on the same page. The PDF format is a fixed-document format, which means that all the elements inside are represented by separate geometries and glyphs. Any of these elements are positioned in a fixed place. Each word in a PDF document represents several glyphs drawn on positions that are next to each other.

The Position property of the TextFragment class represents the starting position of the fragment. Please, note that a TextFragment instance might contain just a single letter from a word or several words and that depends on how the PDF document is generated. Getters for the position-related properties are available and you can use them for obtaining the position of a specific element in the document.

Can you share more information on the exact scenario you are trying to achieve? Why you need the coordinates and which exactly coordinates will work for the case? Are the ones for each letter? Or you need them per word?

Regards,
Tanya
Progress Telerik

Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.
0
André
Top achievements
Rank 1
answered on 13 Feb 2020, 11:37 AM

Hi Tanya,

I want to make an XML export of each letter or words of PDF-invoices to scan for InvoiceNo, Invoicedate, Totaal Amount etc.
Having the coordinates I am able to make a template per client for future invoices.

The export will be something like:
<PDFTekst><Woord>
<Pagina>1</Pagina>
<BeginX>42,51968</BeginX>
<BeginY>115,1646</BeginY>
<EindeX>90,52768</EindeX>
<EindeY>121,2126</EindeY>
<Tekst>InvoiceNo</Tekst>
</Woord><Woord>
<Pagina>1</Pagina>
<BeginX>92,75168</BeginX>
<BeginY>115,1646</BeginY>
<EindeX>103,4237</EindeX>
<EindeY>121,2126</EindeY>
<Tekst>202017283</Tekst>
</Woord><Woord>
<Pagina>1</Pagina>
<BeginX>42,51968</BeginX>
<BeginY>123,6685</BeginY>
<EindeX>85,63969</EindeX>
<EindeY>129,7325</EindeY>
<Tekst>Date</Tekst>
</Woord><Woord>
<Pagina>1</Pagina>
<BeginX>87,86369</BeginX>
<BeginY>123,6685</BeginY>
<EindeX>92,31168</EindeX>
<EindeY>129,7325</EindeY>
<Tekst>2020-02-13</Tekst>
</Woord></PDFTekst>

I already succeed based on another library (not telerik), but I want to build it out of Telerik components.
I am able to build the words or lines out of characters based on thier positions, so positions of each letter is also fine with me. Words would be great, so I prefer this if it is also possible.

I hope this will clear my question.

Best regards,
André

0
Tanya
Telerik team
answered on 14 Feb 2020, 12:03 PM

Hello André,

You can create templates using interactive forms and this would be the easiest way to achieve the desired functionality. The template can be generated using the API of PdfProcessing and then visualized in PdfViewer. Would that be an option for you?

Regards,
Tanya
Progress Telerik

Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.
0
André
Top achievements
Rank 1
answered on 14 Feb 2020, 05:50 PM

Hello Tanya,

I think you misunderstand my project. The templates I mentioned are only a registration of the positions of the textfragments in a database. This is done by another programm.

What I need for my Telerik project are the positions of every textfragment. In your first reply you already give me an example how to do this.

My only problem is how to translate the "var position" to the X-Begin, X-End, Y-Begin, Y-End values. The VAR seems to be a matrix and I don't know how to get the coordinates out of the VAR.

Can you show me how to get the positions out of the matrix into strings by a little example?

Thanks for helping me.

Best regards,
André

0
Tanya
Telerik team
answered on 17 Feb 2020, 10:48 AM

Hi André,

Please, excuse me for the misunderstanding.

The MatrixPosition object obtained from the TextFragment exposes the OffsetX and OffsetY properties. These properties determine the start position of the fragment. Its end position, however, is determined dynamically by the specific font settings applied to the content. That is why information for the ending position of the content is not available.

Hope this answers your question.

Regards,
Tanya
Progress Telerik

Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.
0
André
Top achievements
Rank 1
answered on 17 Feb 2020, 10:56 AM

Hi Tanya,

I already find this link. This is all based on creating. I need to GET and not to SET.

Can you show me how to get the positions out of the matrix into strings by a little example?

Best regards,
André

0
Tanya
Telerik team
answered on 17 Feb 2020, 12:46 PM

Hi André,

You can use the following code to get the positions and the content of a TextFragment as a string:

string startX = textFragment.Position.Matrix.OffsetX.ToString();
string startY = textFragment.Position.Matrix.OffsetY.ToString();
string text = textFragment.Text;

Regards,
Tanya
Progress Telerik

Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.
0
André
Top achievements
Rank 1
answered on 17 Feb 2020, 01:52 PM

Hi Tanya,

Thanks. This was what I needed. I tried this, but didnot succeed. To use Matrix I had to add a Reference to WindowsBase. This seems to be my biggest problem.

Regards
André

0
Tanya
Telerik team
answered on 18 Feb 2020, 09:01 AM

Hi André,

Can you please share more details on what is preventing you from adding a reference to WindowsBase? The assembly should be available with your .NET Framework installation and it is a dependency of the PdfProcessing library.

Regards,
Tanya
Progress Telerik

Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.
0
André
Top achievements
Rank 1
answered on 18 Feb 2020, 09:06 AM

Hi Tanya,

Nothing is preventing me for adding a reference to WindowsBase. Because I didn't have one, I had the problem using Matrix. Adding WindowsBase solved my problem.

I succeeded yesterday in making my application work, thanks to you.

Regards,
André

0
Tanya
Telerik team
answered on 18 Feb 2020, 09:59 AM

Hello André,

Thank you for the clarification. I am glad to hear that you managed to achieve the desired result.

Regards,
Tanya
Progress Telerik

Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.
0
André
Top achievements
Rank 1
answered on 18 Feb 2020, 05:05 PM

Hi Tanya,

I just experience a problem on opening, so I have still a question:
Is it possible to check if a PDF file that I want to open is Encrypted and is there a way to open it based on standard encryptions?

I know how to encrypt a PDF file on creating, but not on opening. The examples I find were all based on creating/writing.

Regards,
André


0
Tanya
Telerik team
answered on 19 Feb 2020, 01:57 PM

Hello André,

You can open encrypted PDF documents using the ImportSettings of PdfProcessing's PdfFormatProvider. While an encrypted document is being imported, the UserPasswordNeeded event is fired so you can provide the password.

On a side note, we are always trying to keep the sections and different conversations in the public forums in good order. Thus, I would like to ask you to submit different topics or raise support tickets for the different questions you might have. We believe that such a separation would be beneficial for both sides. Thank you for understanding.

Regards,
Tanya
Progress Telerik

Get quickly onboarded and successful with your Telerik and/or Kendo UI products with the Virtual Classroom free technical training, available to all active customers. Learn More.
Tags
PdfViewer and PdfViewerNavigator
Asked by
André
Top achievements
Rank 1
Answers by
Tanya
Telerik team
André
Top achievements
Rank 1
Share this question
or