As I was synching the podcast feed for this LibriVox essay collection, to keep me company on a long walk tonight, I was reminded of a wart in the feed generator. The auto-generated filenames are just auto-incremented book names. That’s not so bad when you’re listening to a chapter book, but pretty lame when you’re sampling a collection. I don’t want to see:
01_librivox_nonfiction_collection02_librivox_nonfiction_collection03_librivox_nonfiction_collection
Instead I want to see:
What Is Enlightenment? by Immanuel KantDeity and Design by Chapman CohenEscape by Christopher Benson
What idiot wrote that feed generator?
Oh yeah. Me.
If someone wants to improve this before I can find the time, just go for it. The LibriVox crew would really appreciate it, and so would I.
October 16, 2007 at 10:06 pm
yeah, on the list of things to do… generate proper metadata in the rss feeds… help would be appreciated. all the data is there in standard format, just not rss-ized.
October 16, 2007 at 10:16 pm
Yup, that’d sure be nice. A lot of books have lovely descriptive chapter titles, too. It would be great for those to get into the feed somehow.
October 17, 2007 at 8:01 am
Jon,
What language is this script in? I’m doing some stuff w/ MP3 tags in .Net & might be able to help.
October 17, 2007 at 12:26 pm
“What language is this script in?”
Python.
http://jonudell.net/librivox.py
http://jonudell.net/mp3info.py
October 17, 2007 at 10:00 pm
Hey Jon,
Making steady progress, I think.
Python is pretty strange. This statement freaked me out:
return x, y, z
and of course:
(x, y, z) = func(p)
But there are currently a couple of issues:
1) request for the mp3 from archive.org returns a 302 – redirect, which shouldn’t be hard to deal with,
2) how is it that reading a 6K chunk in the middle of the file gives you minutes & seconds? Pretty cool. But I believe that ID3 tags are at the end of the MP3 file, so currently, I can retrieve title & artist when I pull down the entire file …. which is significantly larger than 6K.
Unless I can somehow just pull down the portion of the file that contain the ID3 tags.
We’ll see…
October 18, 2007 at 7:37 am
“request for the mp3 from archive.org returns a 302 – redirect, which shouldn’t be hard to deal with”
Originally I had the script follow that redirect, but the LibriVox folks found it was better to let the RSS reader do that at feed fetch time.
“I believe that ID3 tags are at the end of the MP3 file”
Of course all the metadata comes from the LibriVox database. It could be scraped from the page, or perhaps LibriVox can publish it in a more tractable form.
October 18, 2007 at 11:43 am
our database holds the id3tags, i believe, so we could publish those too i think.
October 18, 2007 at 7:47 pm
Minh:
The HTTP spec defines byte ranges. Some web servers don’t support them, but archive.org does.
http://www.ietf.org/rfc/rfc2616.txt
Section 14.35.1 is what you want.
httplib2 is better than httplib2 for this kind of thing: http://code.google.com/p/httplib2/
docs: http://bitworking.org/projects/httplib2/ref/http-objects.html
Example of using httplib against archive.org to get part of the file:
http://dpaste.com/22843/
October 18, 2007 at 7:47 pm
Correction: *httplib2* is better than *httplib*.
:)
October 18, 2007 at 11:03 pm
Come to think of it, the title & artist info for each track can be acquire by screen scrapping alone. No need to go to the MP3 themselves.
October 19, 2007 at 7:34 am
“title & artist info for each track can be acquire by screen scraping alone”
That is true. However I would recommend that LibriVox publish this metadata as a distinct XML fragment for each work. Not only for the purposes of the feed generator, but for use by other aggregators that will want to get hold of what are, in effect, bibliographic records.