For the Q1 2013 release of Telerik’s controls for ASP.NET AJAX we had an important objective – let the end user paste documents from MS Word in our rich-text Editor and produce nice, clean HTML while still preserving the original appearance. Read on to see the bonus we added :) But first, some background:
Let’s take a look at the HTML bellow, which is generated by pasting a paragraph from MS Word in a simple editable iframe. The original document has bold and red color styles applied.
And all of that is a trimmed down version of the actual content for just two words.
We can see that most of the content is commented xml formatting. So, we can clear the comments using regular expression, which will simplify the content:
content = content.replace(/<!—[\s\S]*?-->/gm, “”);
Now the content will be only the paragraph:
<p class="MsoNormal"><b style="mso-bidi-font-weight:normal"><span style="color:red;mso-ansi-language:EN-US">Some text</span></b></p>
This is, obviously, still not clean enough - it contains MS Word formatting which is not valid CSS (or HTML markup in more complex cases).
Now, can we clean it further by using only regular expressions? Unfortunately, the answer is “no”. The MS Word styles can vary greatly and there is no way to make sure that we have accounted for all possible cases. No need to mention that they are different with the different versions of MS Word.
What else do we do?
There is a simple array with all the CSS properties we want to keep and we compare the actual rules our DOM nodes have with its items. Only the properties present in the array will be kept which effectively removes all invalid properties that MS Word gave us.
This is far more reliable than relying on regular expressions alone. Its only downside is that it is a bit slower, but the trade-off is well worth it:
You can override this array on your page and its name is Telerik.Web.UI.Editor.Utils. cssPropertiesToKeep. Now you have detailed control over the formatting that will be kept:
Just place this line of code at the end of the form, just before the closing </form> tag, and you are set for the entire page! Easy as pie!
What I would like to show you is the new member of the enum it takes – MSWordNoMargins. Setting
is one of the best combinations and this is why it is the default value. It will give you clean HTML content that will preserve the original formatting from the MS Word document. Here is how our original example looks with it:
The good news is that this will work for complex content as well. Give it a shot – try it with a couple of bulleted lists, or colored text, or tables in the online demos. There is still work to be done, so let us know what you need the most – add a comment, post in the forums or open a private support ticket.
Marin Bratanov is a Principal Technical Support Engineer in the ASP.NET AJAX division. Ever since he joined Telerik in early 2011 as a novice, his main focus has been improving the services and customer care the company offers. Apart from work, Marin is an avid reader and usually enjoys the worlds of fantasy and Sci-Fi literature. You can find him on Twitter, Goodreads, LinkedIn and GooglePlus.
Copyright © 2016, Progress Software Corporation and/or its subsidiaries or affiliates. All Rights Reserved.
Progress, Telerik, and certain product names used herein are trademarks or registered trademarks of Progress Software Corporation and/or one of its subsidiaries or affiliates in the U.S. and/or other countries. See Trademarks or appropriate markings.