While updating my home page today, I noticed that that the page listing my InfoWorld articles had become a graveyard of broken links. The stuff is all still there, but at some point the site switched to another content management system without redirecting old URLs. This happens to me from time to time. It’s always annoying. In some cases I’ve moved archives to my own personal web space. But I prefer to keep them alive in their original contexts, if possible. This time around, I came up with a quick and easy way to do that. I’ll describe it here because it illustrates a few simple and effective strategies.
My listing page looks like this:
<p><a href=”http://www.infoworld.com/article/06/11/15/47OPstrategic_1.html”>XQuery and the power of learning by example | Column | 2006-11-15</a></p>
<p><a href=”http://www.infoworld.com/article/06/11/08/46OPstrategic_1.html”>Web apps, just give me the data | Column | 2006-11-08</a></p>
It’s easy to see the underlying pattern:
LINK | CATEGORY | DATE
When I left InfoWorld I searched the site for everything I’d written there and made a list, in the HTML format shown above, that conformed to the pattern. Today I needed to alter all the URLs in that list. My initial plan was to search for each title using this pattern:
site:infoworld.com “jon udell” “TITLE”
For example, try this in Google or Bing:
site:infoworld.com “jon udell” “xquery and the power of learning by example”
Either way, you bypass the now-broken original URL (http://www.infoworld.com/article/06/11/15/47OPstrategic_1.html) and are led to the current one (http://www.infoworld.com/article/2660595/application-development/xquery-and-the-power-of-learning-by-example.html)
The plan was then to write a script that would robotically perform those searches and extract the current URL from each result. But life’s short, I’m lazy, and I realized a couple of things. First, the desired result is usually but not always first, so the script would need to deal with that. Second, what if the URLs change yet again?
That led to an interesting conclusion: the search URLs themselves are good enough for my purposes. I just needed to transform the page of links to broken URLs into a page of links to title searches constrained to infoworld.com and my name. So that’s what I did, it works nicely, and the page is future-proofed against future URL breakage.
I could have written code to do that transformation, but I’d rather not. Also, contrary to a popular belief, I don’t think everyone can or should learn to write code. There are other ways to accomplish a task like this, ways that are easier for me and — more importantly — accessible to non-programmers. I alluded to one of them in A web of agreements and disagreements, which shows how to translate from one wiki format to another just by recording and using a macro in a text editing program. I used that same strategy in this case.
Of course recording a macro is a kind of coding. It’s tricky to get it to do what you intend. So here’s a related strategy: divide a complex transformation into a series of simpler steps. Here are the steps I used to fix the listing page.
Step 1: Remove the old URLs
The old URLS are useless clutter at this point, so just get rid of them.
old: <p><a href=”http://www.infoworld.com/article/06/11/15/47OPstrategic_1.html”>XQuery and the power of learning by example | Column | 2006-11-15</a></p>
new: <p><a href=””>XQuery and the power of learning by example | Column | 2006-11-15</a></p>
how: Search for href=”, mark the spot, search for “>, delete the highlighted selection between the two search targets, go to the next line.
Step 2: Add query templates
We’ve already seen the pattern we need: site:infoworld.com “jon udell” “TITLE”. Now we’ll replace the empty URLs with URLs that include the pattern. To create the template, search Google or Bing for the pattern. (I used Bing but you can use Google the same way.) You’ll see some funny things in the URLs they produce, things like %3A and %22. These are alternate ways of representing the equals sign and the double quote. They make things harder to read, but you need them to preserve the integrity of the URL. Copy this URL from the browser’s location window to the clipboard.
old: <p><a href=””>XQuery and the power of learning by example | Column | 2006-11-15</a></p>
new: <p><a href=”http://www.bing.com/search?q=site%3Ainfoworld.com+%22jon+udell%22+%22%5BTITLE%5D%22″>XQuery and the power of learning by example | Column | 2006-11-15</a></p>
how: Copy the template URL to the clipboard. Then for each line, search for href=””, put the cursor after the first double quote, paste, and go to the next line.
Step 3: Replace [TITLE] in each template with the actual title
old: <p><a href=”http://www.bing.com/search?q=site%3Ainfoworld.com+%22jon+udell%22+%22%5BTITLE%5D%22″>XQuery and the power of learning by example | Column | 2006-11-15</a></p>
new: <p><a href=”http://www.bing.com/search?q=site%3Ainfoworld.com+%22jon+udell%22+%22XQuery and the power of learning by example%22″>XQuery and the power of learning by example | Column | 2006-11-15</a></p>
how: For each line, search for >”, mark the spot, search for |, paste, copy the highlighted section between the two search targets, search for [TITLE], put the cursor at [, delete the next 7 characters, paste from the clipboard.
Now that I’ve written all this down, I’ll admit it looks daunting, and doesn’t really qualify as a “no coding required” solution. It is a kind of coding, to be sure. But this kind of coding doesn’t involve a programming language. Instead you work out how to do things interactively, and then capture and replay those interactions.
I’ll also admit that, even though word processors like Microsoft Word and LibreOffice can do capture and replay, you’ll be hard pressed to pull off a transformation like this using those tools. They’re not set up to do incremental search, or switch between searching and editing while recording. So I didn’t use a word processor, I used a programmer’s text editor. Mine’s an ancient one from Lugaru Software, there are many others, all of which will be familiar only to programmers. Which, of course, defeats my argument for accessibility. If you are not a programmer, you are not going to want to acquire and use a tool made for programmers.
So I’m left with a question. Are there tools — preferably online tools — that make this kind of text transformation widely available? If not, there’s an opportunity to create one. What IFTTT is doing for manual integration of web services is something that could also be done for manual transformation of text. If you watch over an office worker’s shoulder for any length of time, you’ll see that kind of manual transformation happening. It’s a colossal waste of time (and source of error). I could have spent hours reformatting that listing page. Instead it took me a few minutes. In the time I saved I documented how to do it. I wasn’t able to give you a reusable and modifiable online recipe, but that’s doable and would be a wonderful thing to enable.
Search is the new/old link, very clever.
Is it worth perhaps adding a parenthetical search link to the internet archive? I’ve always dreamed of built into browser or http 404 messages a message beyond “NOT FOUND” with a link to the wayback machine (there are some extensions and such)
Could do. I notice, though, that this — https://archive.org/search.php?query=XQuery%20and%20the%20power%20of%20learning%20by%20example — comes up empty.
Cool post. One thought that comes to mind is Nimble Text: http://nimbletext.com/
After I finished your post, I don’t think it’s the right tool for this issue but it’s in the right direction.
The really interesting part was the link you provided to Lugaru Software, they’re in Pittsburgh, PA not that far from me. It’s nice to see small shops running a business on the net.
Nice! Thanks for pointing that out. I agree that it’s in the right direction for programmers, who can express transformations declaratively with regular expressions, but not for non-programmers who can express them procedurally/interactively.
I was thinking coding a link using the original URL, e.g. http://web.archive.org/web/*/http://www.infoworld.com/article/06/11/15/47OPstrategic_1.html
that should be easy to insert?
<a href=”http://web.archive.org/web/*/LINK”>Wayback Machine</a>
Yes, that’d be easy. In this case, though, Wayback doesn’t seem to have the articles. Maybe that’s because they’ve been 404 for a long time?
I really like OpenRefine (http://openrefine.org) for manipulating text – although it has a specific focus on tabular data, but it includes HTML parsing and I think could work for the example you give above.
The great thing about OpenRefine is you can build up complex transformations through small steps, rewinding steps as you want. At the end export a JSON representation of the steps which you can share with others or re-apply to a future project of similarly structured data.
In a project I’m working on (http://gokb.org) we adopted OpenRefine as a mechanism for manipulating data from a wide variety of sources to a common format. We chose OpenRefine exactly because we needed non-coders to be able to easily transform text files on a repeated basis – with the potential of then automating the process for any particular data source using the JSON output by OpenRefine.
N.B.OpenRefine is browser based but not web-based although a startup ‘RefinePro’ is looking to offer OpenRefine as an online service.
Thanks for the reminder. I used that tool back when it started, as Freebase Gridworks. I hadn’t thought of it in this context, because as you say it is (or was) oriented to tabular data. Now I’m curious to revisit it.
I couldn’t resist giving it a go – http://www.meanboyfriend.com/overdue_ideas/2014/12/using-openrefine-to-manipulate-html/
Outstanding! Thanks so much for that demo, for reminding me that the tool I first knew (and fell in love with) as Gridworks still thrives, and for introducing my to the GOKb project.
Jon, I think this is clever and directly related to Martin Klein’s PhD thesis summarized as an article in http://dx.doi.org/10.1007/s00799-014-0108-0 . But I also think it is sad that the original URI of a resource – its web currency – is thrown away in action. See, snapshots of many of these resources are available in web archives around the world and can be found there by means of that original URI and typically not via text searches. The original URI can be used as a key in individual web archives and it can be used with the Memento protocol (RFC 7089) and related infrastructure to find snapshots in web archives around the world. Getting rid of the original URI prevents such discovery.
The problem of throwing away the original URI of a resource also exists when referencing a snapshot of that resource in a web archive, BTW: common practice is to throw away the original URI and replace it by the URI of the snapshot in a web archive. Unfortunately, web archives don’t have the gift of eternal life either. And, so, by throwing away the original URI one has also thrown away the possibility of getting a snapshot from another archive. In the context of the Hiberlink project, I have written the Missing Link document about this issue (see http://mementoweb.org/missing-link/) and proposed to annotate links to make sure the key is not thrown away. Independent from our thinking with this regard in the Memento and Hiberlink projects, the Internet Robustness group at Harvard’s Berkman Center reached the same conclusion: in order to make links robust over time they need to be annotated; the original URI should not be thrown away. In essence, 3 pieces of information should be available: original URI, URI of snapshot (in your case a refind URI), Datetime of linking. Our current implementation of this approach uses the attribute extensibility mechanism provided by HTML5: all attributes starting with data- are legitimate extensions. So, we use data-originalurl, data-versionurl, and data-versiondate to convey the info. In some use cases, the URI of the snapshot (refind URI) goes in href and original URI and datetime are provided in data-originalurl and data-versiondate, respectively. In other cases, the original URI goes in href and the snapshot URI and datetime go in data-versionurl and date-versiondate, respectively. The latter approach is already supported by the Memento extension for Chrome, see http://bit.ly/memento-for-chrome . You can see it at work with bookmarks on BibSonomy and at the mock-up page http://mementoweb.org/missing-link-uri_references.html . See also the presentation “Creating Pockets of Persistence” http://www.slideshare.net/mobile/hvdsomp/creating-pockets-of-persistence
Anyhow, this was a long response just to say that you propose a cool solution but shouldn’t throw away the original URI ;-)
Well said, thanks. Of course I still have the originals. Alan Levine suggested encoding them in alternate links to the Wayback Machine, however it doesn’t seem to include those URLs. How would you recommend to present them? For now, I’ve just put up a parallel page with the original URLs.
Jon, it looks like my response lacked clarity. What I was suggesting is that Step 1, Removing the old URL, is in my opinion not a great thing to do. They are not clutter. Quite to the contrary, they are keys to re-find the pages in web archives around the world. For example, there are several captures of your XQuery article in Internet Archive (e.g. http://web.archive.org/web/20061117152526/http://www.infoworld.com/article/06/11/15/47OPstrategic_1.html) and other web archives may have captures too. So, instead of throwing that Old URL away, I think it would be good to annotate the new link with it. And, while you’re at it, annotate the link also with the date of the article. The combination of the old URL and that date will lead to snapshots in web archives, either by searching those archives one by one, or by searching them all in one go using a Memento client such as Memento for Chrome (http://bit.ly/memento-for-chrome).
Now, the question becomes how exactly to annotate the links. There is a discussion document that explores options (http://mementoweb.org/missing-link/) . From all the presented options, we have decided to (at least for now) go with the data- attribute approach. Using your example, the link would become:
<a href=”http://www.bing.com/search?q=site%3Ainfoworld.com+%22jon+udell%22+%22XQuery and the power of learning by example%22″
data-originalurl=”http://www.infoworld.com/article/06/11/15/47OPstrategic_1.html”
data-versiondate=”2006-11-15″>text</a>
In addition, one could annotate with the URI of a preferred snapshot in a web archive:
<a href=”http://www.bing.com/search?q=site%3Ainfoworld.com+%22jon+udell%22+%22XQuery and the power of learning by example%22″
data-originalurl=”http://www.infoworld.com/article/06/11/15/47OPstrategic_1.html”
data-versiondate=”2006-11-15″
data-versionurl= “http://web.archive.org/web/20061117152526/http://www.infoworld.com/article/06/11/15/47OPstrategic_1.html”>text</a>
The Memento extension for Chrome currently already supports data-versionurl and data-versiondate. Right clicking on a link makes those data- attributes actionable. Doing so leads to a capture of the original resource (value of data-originalurl) available in one of the public web archives around the world that has a snapshot datetime closest to the conveyed one (value of data-versiondate). The extension does not yet support data-originalurl (it currently expects that to be in href) but will soon.
I hope this clarifies my comment. In the course of january 2015, we hope to release more information and tools related to this all via http://mementoweb.org .
Ok, thanks, that’s what I was looking for: guidance on how to annotate. I like data-originalurl and data-versiondata as attributes. That’d be easy to do. data-versionurl seems a lot harder (i.e. not easily automatable) because it would entail manual discovery of archived copies (that may or may not exist). Or is there, in fact, a way to automate such discovery?
Actually, though, in this case since it’s my own stuff I’d just pull the articles down and archive them myself. But again not easily automatable, they do still exist at infoworld for now, and life’s short, so I like the annotations idea.
Jon, I totally understand that – in your use case – data-versionurl is not easy to do. There’s other cases where it is, e.g. when one pushes a resource into a web archive and receives a snapshot URL in return. But, here’s the good news: some time in January 2015 we will launch a Memento related service that would be useful in your case: as the value of data-versionurl you will be able to use baseURL/%datetime%/%original-url% (e.g. baseURL/20061115/http://www.infoworld.com/article/06/11/15/47OPstrategic_1.html). This service will redirect to the snapshot for the original URL that is temporally closest to the conveyed time. In deciding which snapshot that is, a whole range of publicly accessible web archives around the world are consulted.
Nice! Thanks very much, Herbert!
I’d like to reiterate Herbert’s points about the keeping the original URI, annotation, etc. and then provide another reason why rewriting the links to be queries to a SE is bad from a web archiving perspective. When a robot (e.g., from the Internet Archive) crawls the page with this html:
XQuery and the power of learning by example | Column | 2006-11-15
it will check: http://www.bing.com/robots.txt and see that URIs with the prefix “/search” are disallowed. Thus the Internet Archive would not archive the search engine result page (SERP), and thus the “how-we-do-it-now”* URI would be less likely to be archived (unless the IA discovers the direct link from somewhere else).
* http://www.w3.org/Provider/Style/URI.html
Very good point, thanks. In this case, since the stuff all lives at infoworld.com in an altered namespace, I would of course prefer they’d kept the originals intact and used rewrites to redirect on their end. A deluxe solution would be for me to that server-side, on a server I control. Or I could do it right in the page, client-side, I guess. Maybe I should and will. I’d need to repeat the exercise if/when things change again, of course. Sigh.
I totally agree that the macro approach is the pragmatic solution here. You’ll be finished by the time you would have got half way through writing a script to do it “programmatically” – almost certainly having to stop by a library/language reference to refresh your knowledge of regular expressions in the process!
As you mentioned, even macros are a form of programming and unfortunately out of reach of many people.
The tool/technique I’ve found lately that might allow mere mortals to bridge the gap is multiple-cursors-mode (currently only found in programmer’s editors). E.g. http://www.youtube.com/watch?v=jNa3axo40qM in emacs, something similar is also available in Notepad++.
It has much of the power of macros but with an immediate feedback loop that is Brett Victor-esque.
Stuck on an iPad so can’t provide exact steps, but it would be something like:
1. Select href=
2. Create additional cursors at each of the other occurrences
3. Select the url and delete it (or insert ” data-originalurl=”)
4. Move the cursors and copy the title.
5. Go back to the href, type the new search url base and paste in the title.
6. Usually, I smile at that point.
Watch some screencasts of the mode in action, it’s very neat and you can imagine what at really refined version could achieve.
OMG. That is astonishingly wonderful!
FYI: I’ve expanded on some of the web archiving aspects of this approach in a separate post: http://ws-dl.blogspot.com/2014/12/2014-12-20-using-search-engine-queries.html
Excellent summary, Michael, thank you!
The elephant in the room here, for me, is the BYTE archive. I’m probably the only person who would restore (at least some of) it. The (repeated) loss of it was so painful that I’ve avoided thinking about it. But maybe it should be a holiday project.