Screencasting and scripting

I was chatting the other day with Jim Hugunin about an earlier posting on automation and accessibility, and Jim highlighted a point that’s worth calling out separately. If you had a script that could drive an application through all of the things shown in a screencast, you wouldn’t need the screencast. The script would not only embody the knowledge contained passively in the screencast, but would also activate that knowledge, combining task demonstration with task performance.

Of course this isn’t an either/or kind of thing. There would still be reasons to want a screencast too. As James MacLennon pointed out yesterday:

All too often, the classic on-the-job training technique has been “just follow Jim around, and do what he does for the next three weeks …”. This kind of unstructured training doesn’t lend itself to easily to written documentation – it’s the nature of the process as well is the nature of the people. Video, however, allows us to simulate this “follow him around” approach.

Citing Chris Gemingnani’s Excel recreation of a New York Times graphic, James says:

This kind of approach clicked with me, because this was my preferred method for learning a new programming environment. If I could just get an experienced programmer to take me through the edit / compile / debug / build cycle, I would be off and running.

So you’d really want both the screencast and the script — for extra credit, synchronized to work together.

What stands in the way of doing this? Don’t the Office applications, for example, already have the ability to record scripts? Yes, they do, but that flavor of scripting targets what I called engine-based rather than UI-based automation. Try this: Launch Word, turn on macro recording, and then perform the following sequence of actions:

  1. Mailings
  2. Recipients
  3. Type New List

Now switch off the recorder and look at your script. It’s empty, because you haven’t yet done anything with the engine that’s exposed by Word’s automation interface, you’ve only interacted with the user interface in preparation for doing something with the engine.

It would be really useful to be able to capture and replay that interaction. And in fact, I’ve written a little IronPython script that does replay it, using the UI Automation mechanism I discussed in the earlier posting. It’s not yet even really a proof of concept, but it does contain three lines of code that correspond exactly to the above sequence. Each line animates the corresponding piece of Word’s user interface. So when you run the script, the Mailings ribbon is activated, then the Recipients button is highlighted and selected, and then the Type New List menu choice appears and is selected.

What I’m envisioning here is UI-based semantic automation. I call it UI-based to distinguish it from the engine-based approach that bypasses the user interface. I call it semantic because it deals with named objects in addition to keystrokes and mouseclicks. Is this even possible? I think so, but so far I’ve only scratched the surface. Deeper in there be dragons, some of which John Robbins contends with in the article I cited. I’d be curious to know who else has fought those dragons and what lessons have been learned.

Posted in .

8 thoughts on “Screencasting and scripting

  1. Jon,

    It sounds a little like you’re looking for something like IBM’s Rational Robot or even Mercury’s WinRunner. From the different aspect of testing they need to try solve a similar problem of recording/playing back direct UI-based interaction for the development/testing cycle.

    Maybe a light-weight version of those kind of tools will emerge? Goodness know the Behavior/Test Driven and Unit Testing styles would benefit from better tools like this on the development side…

  2. A product called Epiplex from a company called Epiance ( has been able to do what you are suggesting for years. They used to have full UI-based automation that they essentially promoted as a cross-application macro tool but found it to be fragile enough (e.g., it tends to break when screens are updated) that they appear to have stepped back from that particular application of the technology. (It’s been a while since I’ve had contact with the company, so I’m going by what they have on their web site.) However, the product still does support interactive cue cards that walks a person through the steps of a process and monitors their progress using the same basic UI automation technology. The guy who was the real genius behind the product design, Gary Dickelman, is no longer with Epiance but runs his own company, EPSSCentral ( You’ll want to talk to him; he knows more about what’s been done with this stuff than anyone.

    In the same vein as epiplex, if you’re content to work only with HTML-based UIs, is ActiveGuide by Rocketools (

    At any rate, the idea that you are describing is good, and it works.

  3. “”

    Actually CoScripter is where this whole discussion started, here:

    and before that here:

    The follow-on question is: How to achieve these effects across application styles (desktop / RIA / browser)?

    “IBM’s Rational Robot or even Mercury’s WinRunner”

    Are those tools semantic in the sense I mean here, and not just mouse-click and keystroke based?

  4. To take matters a little off your intended target, I was struck by Jim’s thought: “If you had a script that could drive an application through all of the things shown in a screencast, you wouldn’t need the screencast.”. A very programmery thought, indeed. Theoretically, the two are semantically identical and can be substituted with no loss of generality, etc., etc.

    I think a lot of the value in an expert’s screencast is capturing missteps and grumblings. What the expert has, and the novice doesn’t, is a sense of the limitations of the tool. Conventional training covers the positive space of what a tool can accomplish, without illuminating the negative space. Little places where you get stuck and grumble–e.g. my bitching about font handling on Windows–may convey as much meaning to the audience as the smooth parts.

    I can convey this meaning with my voice. Maybe these semantic tools need to be capturing my blood pressure in real time? ;-)

  5. “I think a lot of the value in an expert’s screencast is capturing missteps and grumblings. What the expert has, and the novice doesn’t, is a sense of the limitations of the tool.”

    That’s a great point. I would have edited out some of the missteps and grumblings that you left in your NY Times screencast. But it’s clearly better with them left in, for exactly the reason you say.

    OTOH the 11 minutes I extracted from my 50-minute session with the Resolver folks, which elides missteps as well as less-than-compelling feature demonstrations, is exactly right for something that’s so new and experimental. There will be plenty of time later to illuminate the negative space, but for now the positive space — which is large and uncharted — wants to be lit up.

    “Maybe these semantic tools need to be capturing my blood pressure in real time?”

    Dunno about that but it would be, no joke, a terrific enhancement for tools that seek to capture and evaluate users’ experiences with software, e.g.:

Leave a Reply to slimamamouCancel reply