In last week’s item on social scripting, I suggested that CoScripter’s automation strategy — based on simple English instructions that people can easily read, write, and share — could in theory work across the continuum of application styles. And arguably it will need to, because we’re increasingly likely to mix those styles. If you begin to rely on an automation sequence for your bank’s web application, for example, you’ll be sorry to have it broken by an upgrade that introduces AJAX, Flash, or Silverlight components.
What enables CoScripter to work in the web domain is the document object model (DOM) of which every web page is a rendering. Because JavaScript code can explore and interact with the DOM’s tree of user-interface objects, the browser can be driven semantically, by object names and properties, rather than literally, by mouse clicks and keystrokes. The literal method is workable, and there many tools that make excellent use of it. The semantic method is more reliable if available, but it isn’t always. So the literal method winds up being the common denominator, because every style of application will respond to mouse clicks and keystrokes.
There is another kind of semantic technique long supported by desktop applications that define object models, notably the Mac’s AppleScript object model and Windows’ Component Object Model. These technologies enable automation scripts to reach below the user interface of applications, and to work with their internal machinery.
Using the Word object model, for example, you can automate a mail merge. If you run this program, you’ll see Word launch, you’ll see a data document written by an invisible hand, and then you’ll see a mail merge appear. What you won’t see are the user-interface actions required to produce these effects, because this level of automation bypasses the user interface.
So let’s distinguish between two flavors of semantic automation. The mail merge script does what I’ll call engine-based semantic automation. And CoScripter does what I’ll call UI-based semantic automation.
These two flavors are useful in quite different ways. With the engine-based approach, an automation script uses the application as if it (the application) were a service. In this case you don’t want windows and dialog boxes popping up all over the place, you just want to feed inputs and harvest outputs. The engine-based approach works accurately and efficiently, but it doesn’t yield a representation of task knowledge that a normal person could use, learn from, adapt, or share.
With the UI-based approach, an automation script uses the application as if it (the script) were a human being. It sees and touches exactly what the human sees and touches. This is not the optimal way to crank out a thousand mailing labels. But the UI-based approach does yield a representation of task knowledge that a normal person could use, learn from, adapt, or share.
Shareable representations of task knowledge are incredibly useful and powerful. Screencasts are one such representation, and as many people have noticed in recent years, they can radically outperform traditional forms of documentation. But you can’t interact with a screencast or concisely describe it. You can only watch and learn and imitate. Although that’s way better than not being able to watch and learn and imitate, interaction and concise description would be better still.
CoScripter delivers that superior experience of interaction and concise description. It does so by means of UI-based semantic automation which, in turn, is enabled by the browser’s document object model.
What might enable a more comprehensive flavor of UI-based semantic automation? Noodling on this question I arrived at one possible answer: the Windows UI Automation API, which is part of .NET Framework 3.0. I’d heard of it, but hadn’t connected the dots. In this June 2005 article for the ACM’s Special Interest Group on Accessible Computing, Rob Haverty lays out the rationale for this relatively new mechanism:
Windows UI Automation unifies disparate UI Frameworks such as Avalon [Windows Presentation Foundation], Trident [the browser], and Win32 so that code can be written against one API rather than several.
The basis of this unification is a tree of automation elements that is, in effect, a generic document object model. Automation providers map various specific object models, notably those of the browser and of Windows, into the generic tree. The API provides mechanisms for searching the tree and interacting with its elements.
It’s a powerful system that is also accurately described by John Robbins as “intensively fiddly.” So in this March 2007 MSDN article, he provides and illustrates the use of a set of convenience wrappers around the raw System.Windows.Automation classes. The sample program included with that article drives Notepad through a few basic operations. Could it be extended in the direction of CoScripter, in a way that realizes UI Automation’s ambition to uniformly control Windows and web applications?
I took a crack at that, and concluded that creating even a proof-of-concept will require more time and more programming chops than I can muster. But I’d be interested to hear from anyone who’s gone further down that path. I think this is potentially a very big deal. Although I suspect most programmers see UI Automation in the context of software testing, for which it is indeed well suited, Rob Haverty’s article suggests that it was primarily motivated by the need for better assistive technologies and improved accessibility.
When Tessa Lau says that accessibility guidelines are the lifeblood of CoScripter, she’s talking about affordances for people whose cannot otherwise use the full capability of their software. But consider Rob Haverty’s definition of accessible technology:
Accessible technology enables individuals to adjust their computers to meet their visual, hearing, dexterity, cognitive, and speech needs.
I like his use of the word cognitive because in some sense we are all cognitively impaired when we try to use software. For most people, most of the time, the concept count is way too high. We don’t normally think of automation as an assistive technology. But arguably it is one. And when automation yields interactive documentation that lives in shared information spaces, it becomes a really potent assistive technology.
In case it’s not obvious, I am not claiming that Windows UI Automation can realize this vision of assistive automation across the spectrum of application types. It’s currently only available by default for Vista, and optionally for Windows XP if enhanced with the .NET Framework 3.0. It is not part of Silverlight or Moonlight, though conceivably one day it might be. And it clearly has nothing to do with Mac OS X, or Java, or Flash, or the Linux desktop.
But the idea of UI-based semantic automation is something that could apply in all these domains. A proof-of-concept CoScripter-like application-plus-service spanning two major domains — Windows desktop apps and browser-based apps running on Windows — would be a big step toward that broader vision.
I still haven’t experimented with CoScripter but …
surely Windows Automation is conceptually just like a Unix batch file except with an unnecessary screencast to keep you informed/entertained/drain your resources. In that scenario the UI reveals itself as just extra baggage – indicating the folly of obliging an OS to always run a GUI when it starts up – something you’d rather avoid in clustering environments – where you can’t use Windows without opening … erm … any windows!
Also I doubt a merge between OS and browser scripting will produced dividends since it will mess with the user’s perceptions of security – ActiveX and all that. In fact, the CoScripter runs in FF which isn’t intergated into the OS gives me some confidence to try it out.
“surely Windows Automation is conceptually just like a Unix batch file”
What I’m calling engine-based automation is akin to the Unix way of connecting software components in a pipeline.
What I’m calling UI-based automation is a different beast. It’s a way of working with the machinery that people use to directly manipulate software components.
debt company consolodation consolodation about debt