As I was synching the podcast feed for this LibriVox essay collection, to keep me company on a long walk tonight, I was reminded of a wart in the feed generator. The auto-generated filenames are just auto-incremented book names. That’s not so bad when you’re listening to a chapter book, but pretty lame when you’re sampling a collection. I don’t want to see:
Instead I want to see:
What Is Enlightenment? by Immanuel KantDeity and Design by Chapman CohenEscape by Christopher Benson
What idiot wrote that feed generator?
Oh yeah. Me.
If someone wants to improve this before I can find the time, just go for it. The LibriVox crew would really appreciate it, and so would I.
11 thoughts on “Want to help improve LibriVox?”
yeah, on the list of things to do… generate proper metadata in the rss feeds… help would be appreciated. all the data is there in standard format, just not rss-ized.
Yup, that’d sure be nice. A lot of books have lovely descriptive chapter titles, too. It would be great for those to get into the feed somehow.
What language is this script in? I’m doing some stuff w/ MP3 tags in .Net & might be able to help.
“What language is this script in?”
Making steady progress, I think.
Python is pretty strange. This statement freaked me out:
return x, y, z
and of course:
(x, y, z) = func(p)
But there are currently a couple of issues:
1) request for the mp3 from archive.org returns a 302 – redirect, which shouldn’t be hard to deal with,
2) how is it that reading a 6K chunk in the middle of the file gives you minutes & seconds? Pretty cool. But I believe that ID3 tags are at the end of the MP3 file, so currently, I can retrieve title & artist when I pull down the entire file …. which is significantly larger than 6K.
Unless I can somehow just pull down the portion of the file that contain the ID3 tags.
“request for the mp3 from archive.org returns a 302 – redirect, which shouldn’t be hard to deal with”
Originally I had the script follow that redirect, but the LibriVox folks found it was better to let the RSS reader do that at feed fetch time.
“I believe that ID3 tags are at the end of the MP3 file”
Of course all the metadata comes from the LibriVox database. It could be scraped from the page, or perhaps LibriVox can publish it in a more tractable form.
our database holds the id3tags, i believe, so we could publish those too i think.
The HTTP spec defines byte ranges. Some web servers don’t support them, but archive.org does.
Section 14.35.1 is what you want.
httplib2 is better than httplib2 for this kind of thing: http://code.google.com/p/httplib2/
Example of using httplib against archive.org to get part of the file:
Correction: *httplib2* is better than *httplib*.
Come to think of it, the title & artist info for each track can be acquire by screen scrapping alone. No need to go to the MP3 themselves.
“title & artist info for each track can be acquire by screen scraping alone”
That is true. However I would recommend that LibriVox publish this metadata as a distinct XML fragment for each work. Not only for the purposes of the feed generator, but for use by other aggregators that will want to get hold of what are, in effect, bibliographic records.