Blogging from Word 2007, crossing the chasm

The other day I wrote:

…as someone who is composing this blog entry as XHTML, in emacs, using a semantic CSS tag that will enable me to search for quotes by Mike Linksvayer and find the above fragment, I’m obviously all about metadata coexisting with human-readable HTML.

Operating in that mode for years has given me a deep understanding of how documents, and collections of documents, are also databases. It has led me to imagine and prototype a way of working with documents that’s deeply informed by that duality. But none of this is apparent to most people and, if it requires them to write semantic CSS tags in XHTML using emacs, it never will become apparent.

So it’s time to cross the chasm and find out how to make these effects happen for people in editors that they actually use. Here’s how I’m writing this entry:

This is the display you get when you connect Word 2007 to a blog publishing system, in my case WordPress, and when you use the technique shown in this screencast to minimize the ribbon.

Here’s a summary of the tradeoffs between my homegrown approach and the Word-to-WordPress system I’m using here:

method

pros

cons

My homegrown approach

  • Can use any text editor
  • Source is inherently web-ready
  • Easy to add create and use new semantic features
  • Low barrier to XML processing
  • Only for geeks

Word 2007

  • A powerful editor that anyone can use
  • Source is not inherently web-ready
  • Harder to create and use new semantic features
  • Higher barrier to XML processing

These are two extreme ends of a continuum, to be sure, but there aren’t many points in between. For example, I claim that if I substitute OpenOffice Writer for Word 2007 in the above chart, nothing changes. So I’m going to try to find a middle ground between the extremes.

To that end, I’m developing some Python code to help me wrangle Word’s default .docx format, which is a zip file containing the document in WordML and a bunch of other stuff. At the end of this entry you can see what I’ve got so far. I’m using this code to explore what kind of XML I can inject programmatically into a Word 2007 document, what kind comes back after a round trip through the application, how that XML relates to the HTML that gets published to WordPress, and which of these representations will be the canonical one that I’ll want to store and process.

So far my conclusion is that none of these representations will be the canonical one, and that I’ll need to find (or more likely create) a transform to and from the canonical representation where I’ll store and process all my stuff. We’ll see how it goes.

Meanwhile here’s one immediately useful result. The tagDocx method shown below parallels the picture-tagging example I showed last week. Here, the truth is also in the file. When you use the Vista explorer to tag a Word 2007 file, the tag gets shoved into one of XML subdocuments stored inside the document. But any application can read and write the tag. Watch.

Before:

Run this code:

$ python

import wordxml

wordxml.tagDocx(‘Blogging from Word2007.docx’,’word2007 blogging tagging’)

 

After:

Here’s why this might matter to me. In my current workflow, I manage my blog entries in an XML database (really just a file). I extract the tags from that XML and inject them into del.icio.us. That enables great things to happen. I can explore my own stuff in a tag-oriented way. And I can exploit the social dimension of del.icio.us to see how my stuff relates to other people’s stuff.

But in del.icio.us the truth is not in the file, it’s in a database that asserts things about the file — its location on the web, its tags. If I revise my tag vocabulary in del.icio.us, the new vocabulary will be out of synch with what’s in my XML archive. So I have to do those revisions in my archive. I can, and I do, but it’s all programmatic work, there’s no user interface to assist me.

What I’m discovering about Vista and the Office apps is that they offer a nice combination of programmatic and user interfaces for doing these kinds of things. This blog entry uses three photos, for example. It’s easy for me to assign them the same tags I’m assigning this entry. If I do, I can interactively search for both the entry and the photos in the Vista shell. And I can build an alternate interface that runs that same search on the web and correlates results to published blog entries.

That’s still not the endgame. At heart I’m a citizen of the cloud, and I don’t want any dependencies on local applications or local storage. Clearly Vista and Office entail such dependencies. But they can also cooperate with the cloud and, over time, will do so in deeper and more sophisticated ways. It’s my ambition to do everything I can to improve that cooperation.

Note: There will be formatting problems in this HTML rendering which, for now, painful though it is, I am not going to try to fix by hacking around in the WordPress editor. There are a lot of moving parts here: Word, WordPress, the editor embedded in WordPress (which itself has a raw mode, a visual mode, and a secret/advanced visual mode). I haven’t sorted all this out yet, and I’m not sure I can. (Formatting source code. Why is that always the toothache?)

Anyway, if you want to follow along, I’ve posted the original .docx version of this file here.

Here’s wordxml.py which was imported in the above example. Note that this is CPython, not IronPython. That’s because I’m relying here on CPython’s zipfile module, which in turn relies on a compiled DLL.

import zipfile, re

 

def readDocx(docx):

inarc = zipfile.ZipFile(docx,’r’)

names = inarc.namelist()

dict = {}

for name in names:

dict[name] = inarc.read(name)

inarc.close

print dict.keys()

return dict

 

def readDocumentFromDocx(docx):

arc = zipfile.ZipFile(docx,’r’)

s = arc.read(‘word/document.xml’)

f = open(‘document.xml’,’w’)

f.write(s)

f.close()

return s

 

def updateDocumentInDocx(docx,doc):

dict = readDocx(docx)

archive = zipfile.ZipFile(docx,’w’)

for name in dict.keys():

if (name == ‘word/document.xml’):

dict[name] = doc

archive.writestr(name,dict[name])

archive.close()

 

def tagDocx(docx,tags):

dict = readDocx(docx)

archive = zipfile.ZipFile(docx,’w’)

for name in dict.keys():

if (name == ‘docProps/core.xml’):

dict[name] = re.sub(‘<cp:keywords>(.*)</cp:keywords>’,'<cp:keywords>%s</cp:keywords>’ %

tags, dict[name])

archive.writestr(name,dict[name])

archive.close()

 

 

Posted in .

30 thoughts on “Blogging from Word 2007, crossing the chasm

  1. Hi, Jon,
    As always, I love your stuff. As you present your marvels to us using WordPress’s stuff, have you managed to solve the huge problem they create by turning every quotation mark (single or double) into ‘smartquote’ (usually) or something even weirder (occasionally)? If you have not got a suggested solution to that (or a clever comeback, perhaps) there is no need to expend any time and effort on a reply.
    Thanks,
    Geof (Calgary, Canada)

  2. Makes me think of the question I was asking 3 – 4 years ago: “Where’s the FrontPage of XML?”

    Where’s the basic, GUI-based program that normal workers (95% of the workforce) can use to write an publish XML docs to the network?

    A big opportunity there. Although having Word produce XML docs is another way to go about that.

  3. MS has tried producing a go-between – Windows Live Writer (PC only at the mo)… mind you, I didn’t get on with it – or, rather, my blog didn’t. Then I nuked my laptop with a Vista upgrade and haven’t reinstalled it since. Have to confess, I use Expression (having just upgraded from FrontPage) to write blog posts offline… but I like to flip between lazy editing and HTML code

  4. @Detroit Hack – the Xml system you’re talking about is called InfoPath. It’s been around for a while now. The InfoPath team has their own blog with a ton of great information.

  5. “That’s still not the endgame. At heart I’m a citizen of the cloud, and I don’t want any dependencies on local applications or local storage. Clearly Vista and Office entail such dependencies. But they can also cooperate with the cloud and, over time, will do so in deeper and more sophisticated ways.”

    That’s what I’m afraid of! (I responded over at Scoble’s too).

    After being a part of the Borg for years I stopped using Office, then Windows a few years back. I made MILLIONS of dollars for MS. But more and more I found I couldn’t stand the all or nothing nature of most MS products. My next big sale into the Federal Government was a drawing tool (who’s name I forget at the moment), but before I could get the approvals MS bought the company and practically buried the product. No phone support, no promised new features, watered down next release.

    I don’t want an Office Live that only works if you have Office. I run Linux damit and I’m HAPPY with it. I’d not be opposed to trying Office Live. I tied Live.com and still prefer the Google offerings so far, but I give everything out there a fair shake. I’ve told several people to try the NEW Hotmail as it has features that Gmail doesn’t and is faster than Yahoo or AOL (but it would be nice if you could also POP your mail for free).

    Maybe you don’t realize how every post like this looks like yet another bait and switch move by MS. I even know “lifelong” Microsoft product users who are starting to feel a bit used by the constant bombardment of propaganda. I’m fairly sure this wasn’t your intent, but keep in mind that for many of us there is an immediate visceral reaction: “NEVER!” to suggestions that we abandon perfectly good special purpose tools for anything as bloated as Word.

  6. Jon: What about Windows Live Writer. Although it is definitely still a beta product, I’ve been very impressed with it. It’s handling of (X)HTML is the first I’ve found that is acceptable. For example, if I write in Word and paste the the WordPress WYSIWYG editor, I get lots of crap along with it. If I cut from Word and paste to WLW, it gets rid of the crap; it’s a actually quite nice.

    macbeach: To be fair to Jon, I think he explicitly wrote this to be about people who are not like you; the 90% of people who happily use Word and would rather stick a needle in their eye than have to learn what they would have to learn to run Linux. At the beginning of his post Jon states: “But none of this is apparent to most people and, if it requires them to write semantic CSS tags in XHTML using emacs, it never will become apparent.” Jon’s prior way of doing things *was* apparent to people like you and me, but not to most people as he properly states. So give him credit for his appropriate intentions, unless of course your goal actually was to hijack the post so you could rant… ;-)

  7. “I’m fairly sure this wasn’t your intent,”

    Correct. It was not.

    “but keep in mind that for many of us there is an immediate visceral reaction: “NEVER!” to suggestions that we abandon perfectly good special purpose tools for anything as bloated as Word.”

    I’m not suggesting that anyone abandon anything. If you examine my record you’ll see that I’ve long advocated and used (and sometimes created) lightweight special-purpose tools. I don’t flatter myself into thinking that I’ll have much (if any) influence on product development at Microsoft, but if I could, that’d be one major direction I’d like to see things go.

    Then there’s Office. I won’t pretend to have been an avid user of it in the past, because it’s clear from my record that I have not been. But it is what people in very great numbers do use, and will continue to use, for a long time to come. Now I want to understand more about what aspects of the product people use and don’t use, and why or why not.

    Meanwhile, Office isn’t standing still. While at InfoWorld I deeply explored the infusion of XML capabilities into the product. Jean Paoli is a man with a vision that I share. There are technical challenges that stand in the way but the larger challenge is making the marriage of documents and data work for millions of people in ways that make sense to them. I want to see that play out. And no, not to the exclusion of alternatives. Like Brian Jones[1] I’m “pro-XML formats in general.”

    1. http://blogs.msdn.com/brian_jones/archive/2007/02/15/you-re-either-with-us-or-you-re-against-us.aspx

  8. “What about Windows Live Writer.”

    I’ll be exploring that too. Until I have done so more thoroughly I can’t usefully comment, but I am interested to hear about other folks’ impressions.

    Tools for structured writing are a huge and long-standing passion of mine. There will never be a one-size-fits-all solution. My agenda is to promote structured writing in many different forms so that we can all enjoy the collective benefits that flow from it: structured search, more portable and more useful document/data hybrids.

  9. “The XML system you’re talking about is called InfoPath”

    Yes, it’s part of the story. I said often at InfoWorld, and still believe, that it’s a shame InfoPath was made available only in the enterprise version of Office rather than much more widely.

  10. Nice article!

    Now if only Word 2007 could connect to and blog on “new” Blogger.

  11. In cocoon, and presumably any other environment which can act as a webdav proxy, you can automatically snag XML content at the point of both storage and retrieval from both the web and the desktop. This is done through a generator in cocoon, e.g. “zip:file:mydocument.odt!/content.xml”, at the beginning of a pipeline that can optionally go through XSL contortions to become HTML or some other representation. My example, and only experience, has been with OpenOffice, but the zip archive file format is incredibly flexible for this kind of thing, as long as the underlying XML is understandable.

  12. @2 – Geoff, I use TextControl [http://dev.wp-plugins.org/wiki/TextControl] to avoid the special formatting such as the quotation changes you mention.

    Jon – I’ve been using Windows Live Writer for a while, and it is a great program. It really integrates nicely with WordPress and provides a nice WYSIWYG environment for composing posts. The code produced by it is pretty clean also/

    Cheers!
    Kirupa

  13. “In cocoon, and presumably any other environment which can act as a webdav proxy, you can automatically snag XML content at the point of both storage and retrieval from both the web and the desktop.”

    Hi Art,

    Can you elaborate on the desktop scenario?

    > the zip archive file format is incredibly flexible for this kind
    > of thing, as long as the underlying XML is understandable.

    I think I’m coming around to that point of view, though I intially had — and still have — concerns about the need for tools and/or APIs to crack open the ZIP being an impediment.

  14. The webdav connection would allow the conduit to be a drive or folder on the desktop. The process of saving or dragging a file to the webdav folder could be the point where the XML layer is utilized, and the hope would be that the enduser would only have to deal with setting up a web folder or whatever is the most appropriate webdav option on the desktop. The zip handling would still have to be set up on the webdav side, but hopefully this could be done once for a group rather than requiring everyone to learn and install new tools. OpenOffice has a very clean XML syntax, this is where the XML underpinnings would start to be crucial.

  15. As a number of people have already suggested, you should really take a look at WLW. While I like the WYSIWYG editor, I also like the fact that when using it with blogging platforms like WordPress the application is “aware” of the extra attributes available, like Categories, and dynamically updates the WLW interface.

  16. “I blogged about some of the possibilities for Word as a way of producing HTML.”

    Peter,

    I’m going to quote from your entry if you don’t mind:

    “Last year, when Microsoft were talking about the new blogging feature in Word 2007 I tried, via Brian Jones’s blog to get a conversation started about styles in Word, and how a standard set of styles for common formatting in Word could really, really improve their HTML export. Not just for blogging but for ‘Save as web page…’ as well. I was frustrated that this blogging feature was seen as quite separate from general web export, that Word doesn’t come with a single template with a decent comprehensive HTML compatible stylesheet, and that nobody had the time to explain why Word has three different ways of expressing style on a list item (same problems over at Sun, re OpenOffice.org).

    Word could so easily ship with a ‘use HTML compatible styles’ mode, that turned off all the formatting buttons and had a nice clean interface that only did HTML exportable things, all using styles.”

    This is an excellent point.

  17. Where can I download the wordxml python module? I could not find it in the web. Is this available only in windows version of Python. I’m using a Ubuntu/Linux.

  18. I heard about not bad application-how to repair word 2007 file, know how work with sometimes these files contain tens of pages with critical, sometimes confidential data, all operations take only several minutes and are performed in three mouse clicks, can work with .doc, .docx, .dot and .dotx files and with any version of Microsoft Word text editor, tool can recover only plain text, it means, that text formatting, graphics and all other elements will be lost, tool can recover your data from corrupted *.doc files, located on corrupted media: floppy and CD disks, flash and zip drives, etc.

  19. I need to understand how to post picture using my word 2007. IAwant to use pictures tht are captured and stored on my pc. The picture options box is requsting the following info
    Picture Provider: I selected “My own server”
    Then it request URL Upload? I do not know what this is or where to find it.
    Then that I need to provide Source URL? I do not know what this is or where to find that.
    Could someone please help me figure this thing out?

  20. In this situation advise use-.docx repair,because tool helped me many times and as far as i know it is free,program repairing .docx files and repaire docx document should not be a problem,if you have at least one copy of your *.doc document on corporate file server or somewhere else,tool for repair .docx files allows the user to recover corrupted files in .doc, .docx, .dot and dotx format, as well as *.rtf (Rich Text format),program allows to repair a docx file, repair .docx document, repair .docx damaged files and repair corrupted docx files,it can open your document in Microsoft Word format and attempt to recover any damaged file.

  21. I’m searching the web for info on advanced usages of the blogging abilities of Word 2007 in conjunction with WordPress. Specifically, I’m having trouble with how Word produces .png’s from drawing canvases. My problem is that the png seem to get a max width of 611 pixels. Charts and other graphics are created correctly and can be much wider.

    Do you have any ideas on how this could be solved, or were to look for more info?

    Regards

  22. When I tried to log on to ours pc, a message saying “there are not sufficient resources to load” my account with the default something-or-other came up. The box had a timer that was going to close the box, and then when it closed it would not log me on. I could not turn it on normally so I cut the power. When I turned it back on I logged on fine?
    I read here Fix PC but couldnt make sense?

     

  23. HELP!! I have tried everything!! I’m still having problems posting my pictures from word. I use blogger and as far as I know they are my image provider. I don’t know what to put under the “upload” and “source” URL fields it asks for.

    thanks for your help.

Leave a Reply