Convert Word to clean HTML

February 8, 2008

Despite the HTML filter option, Word documents saved as HTML are bloated with unnecessary code.

One product I recommend that produces squeaky clean HTML from Word is WordCleaner (available from http://www.convertwordtohtml.com/ [was Zapadoo]; US$99; free trial version)

WordCleaner does lots of other stuff too, and you can set up your own templates to dictate what HTML to keep from Word, if any.

I was so impressed with my evaluation, I bought my own copy!

Update (29 February 2008): Version 4 has now been released which cleans up RTF and TXT files too, and doesn’t require Word to be installed. It also supports Word 2007 formats.

Update (7 January 2010): You can convert for free a Word doc up to 20 KB at http://textism.com/wordcleaner/. If your file is larger than 20 KB, you have to pay a small one-off fee; for regular users, you can pay an annual subscription to use this service.  I’ve only done some minor testing  of this one — it reduced a  14 KB HTML file saved from a 30 KB Word document to a 4 KB file, and the code was very clean. Please note: Textism won’t directly convert a Word file — you have to save it as HTML first, then upload the HTML file.

Update (5 Sept 2013): Another free online Word cleaner is available at  http://wordoff.org http://www.targetlocal.co.uk/wordoff. For this one, save the Word document as HTML first, open it in a text editor, copy/paste the code into Word Off, then click Clean It Up. This will strip out all the class and span tags from Word’s overbloated HTML. However, be careful — the text editors I tried (Notepad and EditPlus) on a single-page Word document opened the HTML with each line of text on a new line — Word Off’s cleanup process will strip out those hidden line breaks and thus concatenate some words, so you’ll still have some cleaning up to do to separate joined words. Fortunately, their web interface puts red squiggly lines under the ‘spelling’ errors (in Firefox at least; I didn’t test on any other browsers), so you can identify these joins easily.

Update (11 April 2010): Back in January 2007, in an article in InformIT (http://www.informit.com/articles/article.aspx?p=691502), Ivan Pepelnjak described — in six pages! — how you can use WordProcessingML together with XSL transformations (XSLT) to generate clean, strict (X)HTML-compliant documents from Word sources. Overkill, surely? In my opinion, yes. But perhaps not if you have used non-standard styles for your headings, or want to include some of the document property information in the resulting web pages etc. Pages 1 and 2 of this article are straightforward enough, but by page 4, you’ll need to have a a bit of a clue about code…

[This article was first published in the March 2006 CyberText Newsletter; price and links last checked September 2013]

One comment

  1. Thank you so much for this. Ive been struggling to convert Word docs to html. The HTML filter option gives so much of unnecessary codes that really drives me up the wall. Abhishek http://www.dibugs.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: