For the Q1 2013 release of Telerik’s controls for ASP.NET AJAX we had an important objective – let the end user paste documents from MS Word in our rich-text Editor and produce nice, clean HTML while still preserving the original appearance. Read on to see the bonus we added :) But first, some background:
How MS Word content looks like
Let’s take a look at the HTML bellow, which is generated by pasting a paragraph from MS Word in a simple editable iframe. The original document has bold and red color styles applied. <!--[if gte mso 9]><xml>
</xml><![endif]--> <p class
Some text</span></b></p> <!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
<w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
</xml><![endif]--><!--[if gte mso 10]>
/* Style Definitions */
And all of that is a trimmed down version of the actual content for just two words.
The code behind the magic
We can see that most of the content is commented xml formatting. So, we can clear the comments using regular expression, which will simplify the content:
content = content.replace(/<!—[\s\S]*?-->/gm, “”);
Now the content will be only the paragraph:
<p class="MsoNormal"><b style="mso-bidi-font-weight:normal"><span style="color:red;mso-ansi-language:EN-US">Some text</span></b></p>
This is, obviously, still not clean enough - it contains MS Word formatting which is not valid CSS (or HTML markup in more complex cases).
Now, can we clean it further by using only regular expressions? Unfortunately, the answer is “no”. The MS Word styles can vary greatly and there is no way to make sure that we have accounted for all possible cases. No need to mention that they are different with the different versions of MS Word.
What else do we do?
Customize the stripping options
There is a simple array with all the CSS properties we want to keep and we compare the actual rules our DOM nodes have with its items. Only the properties present in the array will be kept which effectively removes all invalid properties that MS Word gave us.
This is far more reliable than relying on regular expressions alone. Its only downside is that it is a bit slower, but the trade-off is well worth it:
You can override this array on your page and its name is Telerik.Web.UI.Editor.Utils. cssPropertiesToKeep. Now you have detailed control over the formatting that will be kept:
- If you don’t need to keep any styles, you could just set it to empty array Telerik.Web.UI.Editor.Utils. cssPropertiesToKeep = ;
- Or you can set only the styles that you want to remain: Telerik.Web.UI.Editor.Utils. cssPropertiesToKeep = [‘color’, ‘backgroundColor’];
Just place this line of code at the end of the form, just before the closing </form> tag, and you are set for the entire page! Easy as pie!
With a server property
What I would like to show you is the new member of the enum it takes – MSWordNoMargins. Setting
is one of the best combinations and this is why it is the default value. It will give you clean HTML content that will preserve the original formatting from the MS Word document. Here is how our original example looks with it:
Give it a try and share your feedback
The good news is that this will work for complex content as well. Give it a shot – try it with a couple of bulleted lists, or colored text, or tables in the online demos. There is still work to be done, so let us know what you need the most – add a comment, post in the forums or open a private support ticket.