Efficiently Reading ID3v2 Tags Over HTTP? 65
Paul Crowley asks: "Given an HTTP URL for an MP3 file, what's the best way to read its ID3 tags on a GNU/Linux system? It shouldn't be necessary to fetch the whole file: HTTP byteranges should make it possible to fetch only the tiny fraction that's needed, for a big saving in network bandwidth. However, existing ID3v2 libraries are designed to read local files. Extending these libraries for this purpose, or implementing a new one, would be a big job. What's the clean solution - is FUSE the best way, or is there a simpler way that doesn't require root privs? Can I do it using the existing id3lib binary?"
Perhaps not this simple (Score:3, Interesting)
Re:Perhaps not this simple (Score:1, Insightful)
Would not require downloading the whole file. (Score:3, Insightful)
If you really -must- download only the tag and not a byte more, then clearly you'd have to (A.) know the offset in each file where the tag ends. This is not possible without storing that in some sort of database. Which won't work if you aren't the person in control of the server. Or (B.) download the file and scan it as you download lookin
You'd have to extend the API (Score:5, Interesting)
There's no point adding http:// support without also adding ftp:// URL support. FTP supports range fetching as well.
So you have handlers for http:// URLs, ftp:// URLs, and file:// URLs.
Then you'd have to map all the old (compatibility) file-oriented APIs into the new function handlers for file://. (Or maybe the opposite, map file:// into the old API, leaving the old implementation intact)
Re:Not really answering your question.. (Score:3, Informative)
Re:Not really answering your question.. (Score:2)
No need for an index (Score:2)
Re:Not really answering your question.. (Score:1)
Re:Not really answering your question.. (Score:1)
Silly Question (Score:1, Interesting)
Without looking and without knowing, I'm willing to bet there's a Perl module for processing mp3 ID3v2 tags. The whole project can probably be done in Perl in a very small amount of lines.
Re:Silly Question - NOT (Score:1, Informative)
If you had any knowledge of ID3x2, you'd be aware that you DON'T know where the tag will be. It can be placed pretty much anywhere within the file.
Re:Silly Question (Score:2)
Of course there's a package for handing ID3 tags in perl. Heck, I wrote one in PHP. This is about efficiently reading tags over HTTP, where getting the tags requires multiple requests, and not just downloading the whole file. T
Re:Silly Question (Score:2)
Yes, yes you were.... (Score:2)
Yes, it was unclear because you provided too much information.
What you appear to actually want is a generic way to wrap a library that reads a file or stream of some type and be able to feed it from an http stream doing efficent requests, by getting byteranges over http.
The fact that you want ID3 isn't totally relevant, as you want a way to wrap the existing ID3 class to read from http instead of, say, a file. This probably confused a lot of people.
Short answer is that no, I don't th
Oops, typo... (Score:2)
Re:Yes, yes you were.... (Score:2)
But in future I'd write a "general" thing about HTTP as a read-only network filesystem, and then a second "specific" paragraph about why I'm interested.
HTTP 499 (Score:5, Interesting)
I haven't actually done it, but speaking as a server operator, when I look through my server logs, you see some hits that end with status code 499, meaning that the transfer was aborted. So you just have the client software you're writing close the HTTP connection after it locates the end of the ID3 tag. It's probably not 100% efficient, but obviously a lot better than reading the whole MP3 file.
I'm assuming you're doing this in C/C++, but I'll try to do a prototype in perl.
Re:HTTP 499 (Score:2)
Re:HTTP 499 (Score:5, Informative)
That's the problem -- it could be at the end, requiring you to spin through all x bytes (most likely megs) until you get to the end.
Re:HTTP 499 (Score:4, Informative)
Yeah, that could be true, but if it's not within say, the first 100KB, then the smart thing to do is to stop trying to find it and just return an error.
If it's not at the beginning, you could then use byte ranges to try to fast forward to the end and guess that it will be within the last say, 50 KB of the end.
Re:HTTP 499 (Score:5, Insightful)
1. read first 3 bytes with http bytrange
2. if id3, process tag from byte 0
3. else read last 10 bytes
4. if 3di, process tag from backwards
5. else, see if there is a id3v1 tag at the end
6. if yes, read last 10 bytes before id3v1
7. if 3di, then process backwards
So it is possible. He just needs to read the fricking id3 tag definitions.
Source code is stupid (Score:2)
Re:Source code is stupid (Score:3, Insightful)
If he needs to scan hundreds or thousands of files, that WILL add up in a hurry. Also, if he's clever, he can take advantage persistant HTTP connections, diable Nagle's Algorithm and really get a performance boost. Especially over a "slow" link.
Just realized I was nearly offtopic (Score:2)
As far as how you might be able to use an existing library to extend other libraries, It seems like you should be able to save the first x bytes of http (mp3) data to a local temp file and then have the pre-existing id3 library run over that data. I would think that this doesn't necessarily require root.
ID3v2 Sucks (Score:5, Informative)
The number of checks you have to do is phenominal. The biggest worry is buffer overflow where the length given is greater than the actual length of the tag and you read more than is in the file. There are just hundreds of such edge cases. Libraries for ID3v2 are likely to be buggy, crashy, and just no fun.
Re:ID3v2 Sucks (Score:4, Informative)
It does, however, support arbitrary character sets and arbitrary binary formats, though. Not sure there's another way to do it. Vorbis-comments are ASCII only, right?
I look forward to your reply.
Re:ID3v2 Sucks (Score:5, Informative)
And before anyone goes off on one because it's non-standard, I'll point out that MP3 has *no* provision for metadata. ID3v1 and 2's are just as arbitary addons as APEv2; they're just older (and lamer, either in big limitations or extreme overcomplication).
I believe the recommended *standard* way of attaching metadata to an MP3 now is to put it in an MP4 container, which has it's own more sensible format. Again, I'm pretty sure foobar2000 (maybe with some plugin in the Special Installer) can put them in, and I think they should play on anything which knows about MP4. Fully reversable too.
Re:ID3v2 Sucks (Score:1)
Formats currently supporting/using APEv2: MPC, WavPack, APE (Monkey's Audio), MP3
Vorbis comments UTF-8 (Score:5, Informative)
Vorbis-comments are ASCII only, right?
No. The field names are ACSII only (actually a printable subset minus '=') but the contents of the fields are specified [xiph.org] as UTF-8.
The intention was you could put arbitrary binary data in there too, but there's no general mechanism for marking it as anything else. So any non-UTF-8 use would be application specific.
Re:ID3v2 Sucks (Score:2)
It reminded me of nothing so much as ASN.1/BER/DER.
If they were going to do something similar to ASN.1, they should have just used ASN.1 BER. Then writing tag manipulation tools would be easy. ASN.1 BER is complex and a pain in the butt to write from scratch, but lots of good tools exist so writing it from scratch wouldn't be necessary.
CDDB: Feel the Pain (Score:2, Informative)
Here is Netscape's JWZ hilariously sad-but-true rant about the ID3 header format:
CDDB: Feel the Pain [jwz.org]
In case you didn't know, the file format that CDDB (and FreeDB) use is complete garbage. In addition to random idiotic crap like it being impossible to unambiguously represent a song title that has a slash in it, it's rocket science to figure out how long a song is supposed to be. I need this info not only to display it in Gronk (my MP3 jukebox software), but also for some error-checking that my CD-ripp
Completely irrelevant (Score:2)
Re:ID3v2 Sucks (Score:3, Interesting)
I wrote a class for handling ID3v1/2 tags, and it works fine. I use it nearly every day, and it's processed nearly 5000 songs without fail (various versions of v2 tags, mixed in with the old classic v1), from Apples, *nixes and windows.
The format is so specific you can code for almost any eventuality. It's one of the easier binary formats I've worked with, and I think it's a gre
Re:ID3v2 Sucks (Score:3, Informative)
...and never followed. In particular the bit about text being either ISO Latin 1 or UTF-16 (or, in later versions of ID3v2, UTF-8), which is a very sensible idea, is always completely ignored; the overwhelming majority of tag writers, both on Windows and Linux, write text in arbitrary 8-bit encodings (shift-JIS, GBK, whatever) and then mark them as being Latin 1. There's nothing a tag reader can do about that, as there's no way to work out what the writer's locale
how I would do it (Score:3, Insightful)
This doesn't see so complicated to me
Er... (Score:2, Insightful)
Re:Er... (Score:3, Informative)
Re:Er... (Score:1)
Grab the header, file the rest (Score:1)
Continue? (Score:2)
MP3::Info Perl module (Score:4, Informative)
Basically every web jukebox out there does something like this so I'm sure there's plenty of other code available to work from. The mod_perl way is to put SetHandler perl-script then PerlHandler [name of module] in your httpd.conf file so when a URL request falls within that Location or Directory, the perl module handles returning whatever you want it to return.
Hmmm... (Score:1)
The easy solution... (Score:1)
2. Check to make sure you have the whole tag. If it's bigger than what you downloaded, download the rest of the tag.
3. Write to a temporary file
4. Run existing libraries and/or tools against temporary file
Tried using PHP? (Score:2)
fopen() can open local files & URLs - look at the http:// example:
fopen() [php.net]
fgets() will read in data from the steam - you can pick how many bytes you want to read in:
fgets() [php.net]
Dont forget to use fclose() afterwards!
When you get those functions working, it's just a matter of interpreting the content returned. PHP has many useful string functions [zend.com] - many more than ASP does.
These functions are analogou
Re:Tried using PHP? (Score:2)
The author is German, as are a couple of his comments, but the PHP code is tidy, with English variables. The script handles ID3v1 - ID3v2.3 and is LGPL.
No need to reinvent the wheel :-)
Re:Tried using PHP? (Score:1)
Fun problem .. (Score:1)
The simple choice seem to be "read a range of 0-50k" to see if the data is at the start of the file. If it is then you get lucky and win!
If it isn't then you assume it's at the end, and then ideally you just want to just say "give me the last 50k".
Unfortunately you can't do that as there isn't a notion of negative offsets from the end of a file in HTTP. So in the general case you cannot do better than read the whole thing.
I guess if you have a directory index you can parse the filesize from that and th
Re:Fun problem .. (Score:2)
The *real* problem is this: if I were writing this in Java, and the ID3v2 library were in Java, I could easily provide a seekable InputStream object representing a file which I have a URL for, and the ID3v2 library would read only the parts of the file it needed. It wouldn't have to think about the fact that the file was remote, and I wouldn't have to anticipate what it was going to want to read using the cumbersom
Re:Fun problem .. (Score:2)
Of course, this relies on the library seeking in a sensible way, and you might have to hack it to use seek to determine file size rather than fstat.
Re:Fun problem .. (Score:2)
Re:Fun problem .. (Score:2)
Of course, there's the pathological worst case where the ID3 tag is ID3v1 or ID3v2.4, i.e. at the end of the file, and the HTTP server doesn't support HTTP/1.1 byte range requests. In that case, you fetch the entire file. But that's no worse than not having the middle caching layer at all, and it's hard to see how you could do better in that case.
Re:Fun problem .. (Score:2)
Re:Fun problem .. (Score:2)
There is, however, a nifty thing called a HEAD request, which gets all the headers and none of the data. Observe:
Amazed that no-one's really tried to attack this (Score:4, Insightful)
The problems are these:
1) Reading ID3v2 tags on an MP3 file is a complex business. I have no desire to re-implement the libraries that do that, or even to wade deep into the existing codebases, if I can avoid that. And it should be possible to avoid that.
2) Even knowing the size and location of ID3v2 tags is complex. Contrary to popular belief here, those tags can appear at either the beginning or the end of a file, and can be arbitrary size. I already implemented the "fetch some stuff at the beginning and some stuff at the end and feed that to the library" approach, and it sort-of works, but you have to guess the size of the tag. Guess too big, you fetch lots of data unnecessarily. Guess too small, you get breakage or wrong results. By contrast, the libraries that read ID3v2 tags know exactly where and how much to read to glean the appropriate data, and it should be possible to make use of that.
3) I want to read existing data - changing the format of that data is not an option.
So that's why I was suggesting solutions like "FUSE". With FUSE, when the library does a seek and a read, I can arrange for just the relevant portion of the file to be fetched. I don't have to include any knowledge about ID3 in my application - the library does all the work. But the library doesn't have to worry about HTTP byte ranges - FUSE handles that. And the code will always be correct.
The only trouble is that FUSE requires a kernel patch and root privs. The question is, is there a way to do the same trick without those limitations? Or is there a library for reading ID3v2 tags in an object-oriented language that will let me put an efficient back-end for fetching data on request using HTTP byteranges in place of the file?
The best information I've got out of this is that there's a pure-Python implementation of ID3v2 (most implementations appear to be built on top of the C library). This may be hackable to solve my problem.
Those of you who didn't think reading or thinking was necessary before posting - please don't do the next "Ask Slashdot" post the same discourtesy. Thanks.
Re:Amazed that no-one's really tried to attack thi (Score:2)
ID3v2 tags are very interesting, in my opinion :)
You still don't get the fundamental problem? (Score:3, Informative)
If you want a solution that will allow you to escape downloading the whole file, just check for ID3 in the first 3 bytes and 3DI @ 10 bytes back from the end. Download a couple K in the
Re:You still don't get the fundamental problem? (Score:2)
Think about the actual calls to "read" and "seek" on the filehandle the library does. Now imagine that in the background, you fetch parts of the file only at the moment the application calls "read". You'll see that the application does not "read" every last byte from the file - usually much, much less.
FUSE does exactly this "sorcerer's magic".
Or think about what would happen if the file were served by NFS, rather than by HTTP. Again, only the parts of the file that were needed
Alrighty then. (Score:2)
Yes, this involves doing real work. No, Ask Slashdot rarely does real work for people.
My "sorcerer's magic" comment, by the way, was trying to communicate the idea that even these lib
Re:Amazed that no-one's really tried to attack thi (Score:1)
Couple of thoughts:
LD_PRELOAD might help you to override open, seek, read, etc calls. You can probably do a HEAD on the URL to get the actual size of the MP3, without downloading the entire file and fake stat results from that.
Seek and read can be faked with Byte-Ranges, as you have already indicated.
Problems that I see are convincing the application to open "http://host:port/path" using the filesyst