This is a migrated thread and some comments may be shown as answers.

Mismatch when exporting HTML to DOCX and vice versa

7 Answers 535 Views
WordsProcessing
This is a migrated thread and some comments may be shown as answers.
Karl-Heinz
Top achievements
Rank 1
Karl-Heinz asked on 19 Dec 2017, 01:33 PM

Hi,

 

I have an issue with converting between HTML and DOCX and vice versa. What happens is that I export an HTML unordered list to docx and then back from docx to HTML, using DocxFormatProvider and HtmlFormatProvider. The condensed code below illustrates my problem.

Version of Telerik.Windows.Document.* libraries is 2017.2.428.40, version of DocumentFormat.OpenXml is 2.5.5631.0.

 

var html = "<ul><li>1</li><li>2</li></ul>";
Console.WriteLine("Original HTML: "+html);
var docxFormatProvider = new DocxFormatProvider();
var htmlFormatProvider = new HtmlFormatProvider();
var document = htmlFormatProvider.Import(html);
var bytes = docxFormatProvider.Export(document);
document = docxFormatProvider.Import(bytes);
htmlFormatProvider.ExportSettings.DocumentExportLevel = DocumentExportLevel.Fragment;
htmlFormatProvider.ExportSettings.StylesExportMode = StylesExportMode.None;
htmlFormatProvider.ExportSettings.IndentDocument = false;
html = htmlFormatProvider.Export(document);
Console.WriteLine("New HTML: "+html);
Console.ReadKey();

 

The console output is:

Original HTML: <ul><li>1</li><li>2</li></ul>
New HTML: <body><ul style="list-style-type: disc;"><li style="font-family: Symbol;" value="1"><span style="font-family: Times New Roman;">1</span></li><li style="font-family: Symbol;" value="2"><span style="font-family: Times New Roman;">2</span></li></ul></body>

 

The ooxml generated when exporting the HTML to docx is:

  <w:body>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="NormalWeb" />
        <w:numPr>
          <w:ilvl w:val="0" />
          <w:numId w:val="1" />
        </w:numPr>
        <w:rPr />
      </w:pPr>
      <w:r>
        <w:rPr />
        <w:t>1</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="NormalWeb" />
        <w:numPr>
          <w:ilvl w:val="0" />
          <w:numId w:val="1" />
        </w:numPr>
        <w:rPr />
      </w:pPr>
      <w:r>
        <w:rPr />
        <w:t>2</w:t>
      </w:r>
    </w:p>
    <w:sectPr />
  </w:body>
</w:document>

 

What puzzles me is the style definitions on the exported html, for instance font-family: Symbol; on the list elements. We're using the JavaScript API for Office to inject the exported HTML into content controls in Word documents. The example HTML here injects a bullet list in the content control, but if the user adds new bullets to the list in word the font is set to Symbol. Also, I don't understand why spans with font-family Times New Roman is added, this is causing some line spacing issues and we are using Arial as standard.

 

Does anyone have some input on this? Thanks.

 

Best regards,

Geir Morten Hagen

7 Answers, 1 is accepted

Sort by
0
Karl-Heinz
Top achievements
Rank 1
answered on 19 Dec 2017, 01:41 PM

I'm not sure if it is relevant, but the htmlFormatProvider.ImportSettings.DefaultStyleSheet value is:

b, strong { font-weight: bold; }
 
s, del, strike { text-decoration: line-through; }
 
i, em, dfn, var, cite { font-style: italic; }
 
u, ins { text-decoration: underline; }
 
sub { vertical-align: sub; }
 
sup { vertical-align: super; }
 
center { text-align: center; }
 
code, kbd, samp, tt, pre
{
    font-family: monospace;
    font-size: 10pt;
}
 
a {
    color: #06C;
    text-decoration: underline;
}
 
h1 {
    font-size: 24pt;
    font-weight: bold;
    margin-top: 14pt;
    margin-bottom: 14pt;
}
 
h2 {
    font-size: 18pt;
    font-weight: bold;
    margin-top: 14pt;
    margin-bottom: 14pt;
}
 
h3 {
    font-size: 13.55pt;
    font-weight: bold;
    margin-top: 14pt;
    margin-bottom: 14pt;
}
 
h4 {
    font-weight: bold;
    margin-top: 14pt;
    margin-bottom: 14pt;
}
 
h5 {
    font-size: 10pt;
    font-weight: bold;
    margin-top: 14pt;
    margin-bottom: 14pt;
}
 
h6 {
    font-size: 7.55pt;
    font-weight: bold;
    margin-top: 14pt;
    margin-bottom: 14pt;
}
0
Karl-Heinz
Top achievements
Rank 1
answered on 20 Dec 2017, 08:42 AM
Note that the issue with font-family: Symbol; only occurs on unordered lists. The same does not happen on ordered lists.
0
Tanya
Telerik team
answered on 22 Dec 2017, 10:47 AM
Hi Karl-Heinz,

Thank you for the detailed information.

The Times New Roman font family is applied as a default font family by the Normal style of the generated RadFlowDocument. You can change the font family that is used by default by changing the properties of the respective style:
document.StyleRepository.GetStyle("Normal").CharacterProperties.FontFamily.LocalValue = new ThemableFontFamily("Arial");

The font family is actually exported in the document to guarantee that it will look as expected when opened in any other application. The bullet used for the list is a character defined in the Symbol font, thus it needs to be defined to avoid issues in the different applications. Then, the font family of the content inside the list item is set so it can reset the font family of the bullet. I couldn't reproduce an issue with the list items in MS Word. Can you please share more details on how you insert the content into the content control?

The FontFamily property, however, doesn't affect the line spacing in any way and, since a similar setting is not applied to the document, I believe it is a default styling of the application that is showing the content (MS Word, if I properly understand the scenario).

Regards,
Tanya
Progress Telerik

0
Karl-Heinz
Top achievements
Rank 1
answered on 03 Jan 2018, 08:23 AM

Hi,

 

thank you for your answer. I tried a bit back and forth and actually the issue with bullet lists was fixed by just setting the styles export mode to inline on the html format provider.

The font issue was also resolved using your suggestion.

Regarding line spacing I will do some further investigation and will get back to you if I can't figure out what the issue is.

 

Best regards,

Geir Morten Hagen

0
Accepted
Tanya
Telerik team
answered on 04 Jan 2018, 04:11 PM
Hi Karl-Heinz,

I am happy to hear that the issues are resolved.

Setting the StylesExportMode to Inline forces the format provider to apply all the styling locally to the elements. Thus, the styles are available in the elements and the behavior you observed is not present anymore. Using the same approach for the line spacing could be helpful as well - in the exported HTML, it should be defined as margin-top and margin-bottom values for the spacing before and after, and as line-height for the line spacing.

Regards,
Tanya
Progress Telerik

0
Karl-Heinz
Top achievements
Rank 1
answered on 10 Jan 2018, 07:49 AM

Thanks for all your help, Tanya!

 

Our scenario is a bit complex with the need to go back and forth between html and ooxml, but all the issues has now been resolved. The line spacing issue was not really a style issue after all. In the generated html from ooxml, all <li> items was given a value attribute. I removed this attribute using HtmlAgilityPack and it's working fine now.

 

Best regards,

Geir Morten Hagen

0
Tanya
Telerik team
answered on 10 Jan 2018, 08:03 AM
Hi Geir Morten Hagen,

Thank you for the feedback. I am happy to hear that everything is now working as desired.

Regards,
Tanya
Progress Telerik

Tags
WordsProcessing
Asked by
Karl-Heinz
Top achievements
Rank 1
Answers by
Karl-Heinz
Top achievements
Rank 1
Tanya
Telerik team
Share this question
or