From PDF to PWP: A vision for compound web documents

I’ve been in the web publishing game since it began, and for all this time I’ve struggled to make peace with the refusal of the Portable Document Format (PDF) to wither and die. Why, in a world of born-digital documents mostly created and displayed on computers and rarely printed, would we cling to a format designed to emulate sheets of paper bound into books?

For those of us who labor to extract and repurpose the contents of PDF files, it’s a nightmare. You can get the text out of a PDF file but you can’t easily reconstruct the linear stream that went in. That problem is worse for tabular data. For web publishers, it’s a best practice to separate content assets (text, lists, tables, images) from presentation (typography, layout) so the assets can be recombined for different purposes and reused in a range of of formats: print, screens of all sizes. PDF authoring tools could, in theory, enable some of that separation, but in practice they don’t. Even if they did, it probably wouldn’t matter much.

Consider a Word document. Here the tools for achieving separation are readily available. If you want to set the size of a heading you don’t have to do it concretely, by setting it directly. Instead you can do it abstractly, by defining a class of heading, setting properties on the class, and assigning the class to your heading. This makes perfect sense to programmers and zero sense to almost everyone else. Templates help. But when people need to color outside the lines, it’s most natural to do so concretely (by adjusting individual elements) not abstractly (by defining and using classes).

It is arguably a failure of software design that our writing tools don’t notice repetition of concrete patterns and guide us to corresponding abstractions. That’s true for pre-web tools like Word. It’s equally true for web tools — like Google Docs — that ape their ancestors. Let’s play this idea out. What if, under the covers, the tools made a clean separation of layout and typography (defined in a style sheet) from text, images, and data (stored in a repository)? Great! Now you can restyle your document, and print it or display it on any device. And you can share with others who work with you on any of their devices.

What does sharing mean, though? It gets complicated. The statements “I’ll send you the document” or “I’ll share the document with you” can sometimes mean: “Here is a link to the document.” But they can also mean: “Here is a copy of the document.” The former is cognitively unnatural for the same reason that defining abstract styles is. We tend to think concretely. We want to manipulate things in the digital world directly. Although we’re learning to appreciate how the link enables collaboration and guarantees we see the same version, sending or sharing a copy (which affords neither advantage) feels more concrete and therefore more natural than sending or sharing a link.

Psychology notwithstanding, we can’t (yet) be sure that the recipient of a document we send or share will able to use it online. So, often, sending or sharing can’t just mean transferring a link. It has to mean transferring a copy. The sender attaches the copy to a message, or makes the copy available to the recipient for download.

That’s where the PDF file shines. It bundles a set of assets into a single compound document. You can’t recombine or repurpose those assets easily, if at all. But transfer is a simple transaction. The sender does nothing extra to bundle it for transmission, and the recipient does nothing extra to unbundle it for use.

I’ve been thinking about this as I observe my own use of Google Docs. Nowadays I create lots of them. My web publishing instincts tell me to create sets of reusable assets and then link them together. Instead, though, I find myself making bigger and bigger Google Docs. One huge driver of this behavior has been the ability to take screenshots, crop them, and copy/paste them into a doc. It’s massively more efficient than the corresponding workflow in, say, WordPress, where the process entails saving a file, uploading to the Media Folder, and then sourcing the image from there.

Another driver has been the Google Docs table of contents feature. I have a 100-page Google Doc that’s pushing the limits of the system and really ought to be a set of interlinked files. But the workflow for that is also a pain: capture the link to A, insert it into B, capture the link to B, insert it into A. I’ve come to see the table of contents feature — which builds the TOC as a set of links derived from doc headings — as a link automation tool.

As the Google Drive at work accumulates more stuff, I’m finding it harder to find and assemble bits and pieces scattered everywhere. It’s more productive to work with fewer but larger documents that bundle many bits and pieces together. If I send you a link to a section called out in the TOC, it’s as if I sent you a link to an individual document. But you land in a context that enables you to find related stuff by scanning the TOC. That can be a more reliable method of discovery, for you, than searching the whole Google Drive.

Can’t I just keep an inventory of assets in a folder and point you to the folder? Yes, but I’ve tried, it feels way less effective, I think there are two reasons why. First, there’s the overhead of creating and naming the assets. Second, the TOC conveys outline structure that the folder listing doesn’t.

This method is woefully imperfect for all kinds of reasons. A 100-page Google Doc is an unwieldy construct. Anonymous assets can’t be found by search. Links to headings lack human-readable information. And yet it’s effective because, I am coming to realize, there’s an ancient and powerful technology at work here. When I create a Google Doc in this way I am creating something like a book.

This may explain why the seeming immortality of the PDF format is less crazy than I have presumed. Even so, I’m still not ready to ante up for Acrobat Pro. I don’t know exactly what a book that’s born digital and read on devices ought to be. I do know a PDF file isn’t the right answer. Nor is a website delivered as a zip file. We need a thing with properties of both.

I think a W3C Working Draft entitled Portable Web Publications for the Open Web Platform (PWP) points in the right direction. Here’s the manifesto:

Our vision for Portable Web Publications is to define a class of documents on the Web that would be part of the Digital Publishing ecosystem but would also be fully native citizens of the Open Web Platform.

PWP usefully blurs distinctions along two axes.

That’s exactly what’s needed to achieve the goal. We want compound documents to be able to travel as packed bundles. We want to address their parts individually. And we want both modes available to us regardless of whether the documents are local or remote.

Because a PWP will be made from an inventory of managed assets, it will require professional tooling that’s beyond the scope of Google Docs or Word Online. Today it’s mainly commercial publishers who create such tools and use them to take apart and reconstruct the documents — typically still Word files — sent to them by authors. But web-native authoring tools are emerging, notably in scientific publishing. It’s not a stretch to imagine such tools empowering authors to create publication-ready books in PWP. It’s more of a stretch to imagine successors to Google Docs and Word Online making that possible for those of us who create book-like business documents. But we can dream.

Posted in .

4 thoughts on “From PDF to PWP: A vision for compound web documents

  1. This:

    “Can’t I just keep an inventory of assets in a folder and point you to the folder?”

    You are exploring the idea of doing that for a single “document”. A related thing I have noticed, in corporate use of Google Drive, is a sizable faction of folks who seek to organize all the things into folders. My theory is that their mental immune systems have failed to eradicate the Microsoft Exchange public folders meme.

    To be fair, there is an important tradition of trying to organize all the things (into folders) on the web. It was very popular from maybe 1998 to as recently as 2006. See DMOZ, Yahoo! Directory. Perhaps those were the result of the same meme. Regardless, Google’s organic, automated, searchability won out. Statistically no one uses directories any more on the web.

    Got 50 employee profile Docs? Maybe you should put them in the “Employee Profiles” folder. But wait, you want to see profiles for each of your “Groups” too. Well you could make an org-chart hierarchy (of folders) and put the profile Docs at in the leaf folders. Oh but we’d also like to be able to look up employees by role e.g. “Development Manager” or “Account Representative”.

    This can be done. Though only a small fraction of users know about it, Google Drive supports heterarchy (a doc can be in more than one folder via option-dragging) you are free to place docs into multiple taxonomies. It seems to me, though, that something is lost. Placing the docs in the heterarchy does not inform search. The heterarchy is only in play during navigation.

    In the corporate setting, and specifically with Google Drive, I have been advocating searchability over folders. Google Docs ToC is a great tool. Also Docs linking to other Docs is great. You can make source docs that serve the purpose of folders if you want. Isn’t it better to have a doc called “Employee Profiles” and another called “FORTRAN Programmers” and another called “Widget Product Development Group”?

    Whether docs or folders are used for organization, the maintenance effort is about the same. But using docs instead of folders gains you intersectional queries. If Sally is a FORTRAN programmer in Widget product development, a search for “Sally” will turn that up. I suppose you could get there with the folder approach but you’d have to do it through a combination of searching (search box) and navigation (clicking on folders). But there is power and speed in doing it all through the search box. That’s why Google has never strayed from their single-field UI and that’s why all web browsers have now put search into their address bars.

  2. “heterarchy is only in play during navigation” Ideally it aids navigation of search results but that depends, as you say, on a little-known feature that has always been cognitively unnatural: a thing in more than one folder. Which is really isomorphic to assigning tags, a thing interesting absent from Google Docs (maybe for that reason).

  3. multi-taxonomy structures do work – but be prepared to put on your librarian boots first – i depend on UDC to inform my hierarchy building.

    in a real sense, the objects (docs, sitting in root) are already in multiple ‘folders’; those being the text (s) (fragments) themselves. any phrase/title/heading (sufficiently discrete concept) provides access by way of in-built indexing through mechanisms such as faceted search like

    or, you may know of Tabbles: called by (at least) one wag, “what WinFS should have been” –

Leave a Reply