I had been meaning to explore PubSubHubbub, a protocol that enables near-realtime consumption of data feeds. Then somebody asked me: “Can OData feeds update through PubSubHubbub?” OData, which recently made a splash at the MIX conference, is based on Atom feeds. And PubSubHubbub works with Atom feeds. So I figured it would be trivial for an OData producer to hook into a PubSubHubbub cloud.
I’ve now done the experiment, and the answer is: Yes, it is trivial. In an earlier post I described how I’m exporting health and performance data from my elmcity service as an OData feed. In theory, enabling that feed for PubSubHubbub should only require me to add a single XML element to that feed. If the hub that connects publishers and subscribers is Google’s own reference implementation of the protocol, at http://pubsubhubbub.appspot.com, then that element is:
<link rel="hub" href="http://pubsubbubbub.appspot.com"/>
So I added that to my OData feed. To verify that it worked, I tried using the publish and subscribe tools at pubsubhubbub.appspot.com, at first with no success. That was OK, because it forced me to implement my own publisher and my own subscriber, which helped me understand the protocol. Once I worked out the kinks, I was able to use my own subscriber to tell Google’s hub that I wanted my subscriber to receive near-realtime updates when the feed was updated. And I was able to use my own publisher to tell Google’s hub that the feed had been updated, thus triggering a push to the subscriber.
In this case, the feed is produced by the Azure Table service. It could also have been produced by the SQL Azure service, or by any other data service — based on SQL or not — that knows how to emit Atom feeds. And in this case, the feed URL (or, as the spec calls it, the topic URL) expresses query syntax that passes through to the underlying Azure Table service. Here’s one variant of that URL:
That query asks for the whole table. But even though the service that populates that table only adds a new record every 10 minutes, the total number of records becomes unwieldy after a few days. So the query URL can also restrict the results to just recent records, like so:
The result of that query, however, is different from the result of this one:
Now here’s my question. The service knows, when it updates the table, that any URL referring to that table is now stale. How does it tell the hub that? The spec says that the topic URL “MUST NOT contain an anchor fragment” but “can otherwise be free-form.” If the feed producer is a data service that supports a query language, and the corresponding OData service supports RESTful query, there is a whole family of topic URLs that can be subscribed to. How do publishers and subscribers specify the parent?
6 thoughts on “OData and PubSubHubbub: An answer and a question”
Really happy to hear about your experiments with OData and PubSubHubbub! Cool stuff.
Check out the non-normative section of the spec:
Hubs are supposed to determine feed equivalence by looking at stuff like redirect final URLs and Atom feed IDs. The reference hub code will treat pings on a feed URL with atom ID X as if *all* feeds with atom ID X got pinged (up to some reasonable maximum of fan-out). Thus if you ping on one variant of the feed, it should fetch, update, and push events for all variants, as long as the Atom ID is the same.
That approach is clever and useful for public feed, but it doesn’t scale well and doesn’t address the private feeds case either. Going forward, we’re looking to solve this with an architectural approach to fan-out. Essentially, feed publishers who have massive fan-out of events based on many variants of the same feed need to be able to do this fan-out at publish time; that way they can quickly evaluate new result data against all standing queries and only notify the required callbacks of the new entry. The alternative is re-polling *every* standing query on *every* data insert or update, which isn’t going to work. The hard part of this is that publishers need to look at feed A and feed B and determine they have some kind of relationship. This also requires publishers to be tightly integrated with their hubs (for fat pinging).
I realize that’s not perfectly clear. In the coming weeks I plan to publicize some more technical docs on how this approach should work and how it’s going for us inside of Google on services like Buzz. You’ve hit the nail on the head though: This is the key problem to solve for the scalability of a rich notifications API built on PubSubHubbub. The good news is I think there’s a great solution for it.
The reference hub code will treat pings on a feed URL with atom ID X as if *all* feeds with atom ID X got pinged
So in my simple case, where the feed is based on a single table in a datastore, and where every update invalidates every standing query, it sounds like the problem is already solved w/respect to the reference hub. The OData feed produced by Azure Table includes a feed-level Atom ID which is the name (actually, the URI) of the table. So pinging with any query against that table should update all subscribers to any query against that table?
I didn’t think I saw that happening, but will look again and try to observe it. Thanks!
How does HTTP handle this generally?
For example, if I GET http//example.com/items (an index), then PUT an updated document to http://example.com/items/23, how do caches know that http//example.com/items (and any other views that include http://example.com/items/23) are out of date?
I don’t think HTTP has anything to say about it. Rather, I think cache implementations need to do the right thing.
I recently did an implementation in which all items from item/1 to item/n depend on a single resource. When it changes, all those dependencies fire, and all items — and views involving them — are evicted from the cache.
But I am a novice when it comes to this kind of thing, and would like to hear other views on the subject.
I don’t know if you found the answer to your question. But it really is that there isn’t any way to specify topic url relations today.
Even if 2 topic urls point to the same REST resource with different parameters the hub has no notion of this.
So the solution is that your service needs to track the subscriptions to each topic url internally. Even if 2 topic urls are related, your service needs to ping the hub for each topic url. Does that make sense ?
We could look at figuring out a way to describe the relation of this virtual topic urls. Feel free to discuss on the PubSubHubbub mailing list. It is definitely a generic problem we will have as we roll out PubSubHubbub support for APIs.