[Building Sakai] word to HTML

Adam Marshall adam.marshall at oucs.ox.ac.uk
Wed May 20 04:38:10 PDT 2009


I asked the same question about converting Word to HTML here at Oxford and
was deluged by good suggestions. Here they are head to toe.

==

I find Word 2007 pretty good. It's a bit verbose but functional. choose the
unfiltered version. If you save as filtered HTML, there is still a lot of
grunge-- custom CSS classes and in-line CSS-- but there are no MS
extensions.

==

I've yet to find anything that does a better job that the venerable Office
2000 HTML Filter 2.0 which can still be obtained from: 
url:http://www.microsoft.com/downloads/details.aspx?FamilyID=209ADBEE-3FBD-4
82C-83B0-96FB79B74DED&displaylang=EN
This still deals perfectly well with the verbose nonsense churned out by
Word 2007's 'Save As | Web page, filtered' option.

You can see how well it does by taking a peek at :
- the original Word document at http://denning.law.ox.ac.uk/temp/A_Test.docx
- the Word Save As | Web page, filtered version of it at
http://denning.law.ox.ac.uk/temp/A_Test(filtered).htm, and
- the final version as cleaned by the filter at
http://denning.law.ox.ac.uk/temp/A_Test(stripped).htm

The conspicuous failing here is that the bulleted and numbered lists aren't
proper <ul> or <ol>, but that's an error in the original export, not in the
filtering.

While the filter comes as a little standalone application, it can be run
from the command line. As I had to deal with so many of these conversions at
one point, I ended up adding it as a right-click item in Windows and calling
it 'Strip Word HTML'

At the risk of teaching grandmothers to suck eggs, to add it as a context
menu item, do this:
Fire up an Explorer window
Do Tools | Folder Options...|File Types
Scroll to HTM (and/or HTML)
Click Advanced
Click New...
In the Action box Type 'Strip Word HTML' (or whatever) In the Application
used to perform action, type (including the quote marks):
"C:\Program Files\Microsoft Office\Filters\filter.exe" -bcflmrst "%1" 
(or similar, depending on the location of your downloaded filter and your
selection of switches*).
Check the Use DDE box
Under Application, type 'filter' (without the quote marks).
Hit OK


* The switches are listed on http://support.microsoft.com/kb/291325, a page
which discusses the installation of an additional DLL to allow Word to save
in this stripped form directly. I believe that this Add-in doesn't work in
versions of Word later than W2000, while of course the standalone filter
method described above is just working on the html, so is pretty much
indifferent to the source of the material.

==

What I do is use Word's own HTML export function, and then send it  
through the Word HTML Cleaner at:
http://textism.com/wordcleaner/

==

Creating _decent_ markup from the MS version of XML can be quite difficult.
It is an awful format. Frankly if it is one-time conversion you need then
you could just take it into OpenOffice, saveAs HTML, and then use tidy clean
it up to XHTML.  But I recognise that isn't a good pipeline if you want to
continue editing in Word.

Now if it was TEI XML you wanted to create, then you may be surprised to 
find there has been work done on this here inside OUCS itself.   A 
converter was created by Sebastian's team for the ISO to produce a Word 
-> TEI XML -> Word converter for standards documents. (And of course you
can create HTML/PDF/etc. from your TEI document.) Additionally there are the
OpenOffice to TEI filters that we maintain as well, which allow you just to
SaveAs TEI XML.

[NB - we have written a TEI-XML to HTML translator (called VESTA,
http://wiki.tei-c.org/index.php/Vesta). Believe it or not the OUCS website
is authored in TEI-XML!]

==

I'd load the Word doc into OpenOffice and save as HTML from there, cos its
slightly less icky.

===

Abiword does quite a nice job of converting MS Word and RTF to HTML. 
Although output was  clean HTML, I did want to remove a lot of the
repetitive in-line CSS. It can be run from the command line.
http://www.crazy-wormhole.com/AlinaMeridon/AbiWord/abidocs/howto/howtoexport
html.html#templates
http://www.abisource.com/wiki/PluginMatrix#Import_Export_Filters
http://www.abisource.com/wiki/AbiCommand

==

Hope you find this useful

Adam 

| -----Original Message-----
| > | On May 18, 2009, at 5:14 AM, Adam Marshall wrote:
| > |
| > | > With all this talk of Word & FCK I was wondering if anybody
| > | > successfully
| > | > auto-converts from Word to HTML using something like Wimba
| > | > (formerly course
| > | > genie) or perhaps some XSLT transformation on the new Office XML
| > | > 'standards'?
| > | >
| > | > Pray do tell.
| > | >
| > | > Adam




More information about the sakai-dev mailing list