Many of us are prone to using the Shazam music-identification service whenever we encounter unfamiliar songs. After all, it's just so easy to whip out our phones, open an app, and know everything about a mystery song in seconds. But how does Shazam gives us all this information so quickly?
There is a cool service called Shazam, which take a short sample of music, and identifies the song. There are couple ways to use it, but one of the more convenient is to install their free app onto an iPhone. Just hit the "tag now" button, hold the phone's mic up to a speaker, and it will usually identify the song and provide artist information, as well as a link to purchase the album.
What is so remarkable about the service, is that it works on very obscure songs and will do so even with extraneous background noise. I've gotten it to work sitting down in a crowded coffee shop and pizzeria.
So I was curious how it worked, and luckily there is a paper written by one of the developers explaining just that. Of course they leave out some of the details, but the basic idea is exactly what you would expect: it relies on fingerprinting music based on the spectrogram.
Here are the basic steps:
1. Beforehand, Shazam fingerprints a comprehensive catalog of music, and stores the fingerprints in a database.
2. A user "tags" a song they hear, which fingerprints a 10 second sample of audio.
3. The Shazam app uploads the fingerprint to Shazam's service, which runs a search for a matching fingerprint in their database.
4. If a match is found, the song info is returned to the user, otherwise an error is returned.
If a specific song is hit multiple times (based on examples in the paper I think it needs about 1 frequency hit per second), it then checks to see if these frequencies correspond in time. They actually have a clever way of doing this They create a 2d plot of frequency hits, on one axis is the time from the beginning of the track those frequencies appear in the song, on the other axis is the time those frequencies appear in the sample. If there is a temporal relation between the sets of points, then the points will align along a diagonal. They use another signal processing method to find this line, and if it exists with some certainty, then they label the song a match.
No comments:
Post a Comment