|Music information retrieval: Introduction|
Table of Contents
The technology is transforming how even the casual listener accesses and interacts with music documents and recordings.
Automatic identification of recordings over a cell phone lets listeners buy the music (.mp3 files or ringtones) they want as soon as they hear it. New ways of aligning musical scores with acoustic performances make possible interactive music performance systems—or karaoke that follows the singer instead of the other way round. Query by humming systems let users identify the songs they can't get out of their minds. Music scholars use content-based search to find examples of desired themes in a corpus of music. Beat-matching algorithms help DJs blend audio recordings to create new works. And music recommendation systems introduce people to new music they never would have known they would like. All this is the result of advances in music information retrieval.
Music collections comprise one of the most popular categories of multimedia content, and the collections to be searched are increasingly being moved online. Examples of online music collections include the 29,000 pieces of popular American sheet music in the Lester S. Levy Collection of Sheet Music at Johns Hopkins University (levysheetmusic.mse.jhu.edu/) and the million-plus recordings available through Apple Computer's popular iTunes download service and online repository (www.apple.com/itunes/). Whether online or offline, music librarians index them with text-based metadata tags that describe identifying features (such as title, composer, and performer). Finding the desired recording or sheet music this way is a problem for users who do not already know the metadata for the desired piece. Finding the right place in a recording (skip to where, say, the trumpet begins to play) has been impossible without human intervention.
Even users who know none of the metadata for a desired piece of music may still know a lot about its content, recalling lyrics, instrumentation, melodies, rhythms, harmonies, and more. They would be better off if music databases were searchable based on musical content. The first content-addressable music databases focused on melody. Two notable examples are Themefinder (www.themefinder.org/) and Meldex (www.sadl.uleth.ca/nz/cgi-bin/music/musiclibrary). Themefinder, sponsored by the Center for Computer Assisted Research in the Humanities at Stanford University, contains themes from 20,000 pieces of Classical, Baroque, and Renaissance music, as well as European folk songs. Expert users search for themes by entering information about melodic contours in a text-based interface. Meldex, part of the New Zealand Digital Library Project at the University of Waikato, lets users search a database of more than 9,000 European and Chinese folk melodies by playing query examples on a virtual keyboard on the Meldex Web page. Melodically similar passages are found and returned in a ranked list to the user.
Melodic search engines are just the first step toward making music documents content-searchable. Instrumentation, harmony, lyrics, meter, and genre are other musical structures a listener might wish to find. Identifying audio documents (such as .mp3 and .wav files) involves significant digital signal processing. It also requires advances in source separation (extracting individual sound sources from a recording with multiple simultaneous sound sources) and source identification (individual vocalist, instrument, or other sound in an audio recording). Inferred are verse, chorus, meter, and cadence.
Being able to recognize lyrics in sung text is, perhaps, the most difficult case of speech recognition, given the noisy environment and greatly distorted speech of a typical performance.
Identifying the instruments in a recording is a special case of a more general problem—source identification—that applies to all audio. Other tasks (such as spotting melody, identifying genre, and finding musical meter) are fundamentally musical, since the information to be extracted from the audio relates exclusively to music.
Daniel P.W. Ellis starts us off by investigating how information is found in and extracted from audio recordings, from individual notes to chords to higher-level structures (such as verse and chorus). Outlining the kinds of information researchers are generally interested in extracting, he explores the technical problems that are likely to be solved soon and those that are especially challenging; he also provides pointers to some interesting applications (such as music recommendation) of music signal processing.
When music listeners ponder the technology and process behind music information retrieval, they often assume that finding the relevant document—a written score, a Musical Instrument Digital Interface (MIDI) file, or an audio file—is the goal (and end) of the process. However, for scholars and musicians alike, finding a document is only the beginning—a tool to achieve the task at hand—and often involves musical performance. Roger Dannenberg and Christopher Raphael focus on how to align machine-readable encoding of a musical score to the audio of an acoustic performance of the same score. For live music, automatic score alignment allows computer-controlled accompaniment that adjusts tempo and volume to fit the nuances of a musician's performance. Moreover, score alignment enables music coaching in which the computer analyzes a performance, noting places where the performer missed a note, skipped a section, or did not play the dynamics shown in the score. It also enables random access to a recording, letting the listener select any measure in the score to begin playback.
Prerecorded music is routinely heard outside the controlled environments of home and library, including in noisy cars, restaurants, and bars. People wishing to learn more (such as composer or title) about the music or buy a copy for a personal collection have been out of luck. This potential market, combined with the hundreds of millions of cell phone users worldwide, has prompted development of several pioneering music recognition services that let users identify songs by sampling a few seconds of the audio through a mobile phone. Avery Wang describes one such service—the Shazam music recognition service—a commercially available audio-fingerprinting system that helps users identify specific music recordings by recognizing a 10-second sample delivered by phone, even in noisy environments.
Melody matching is an important area for researchers involved in music information retrieval. William Birmingham et al. describe a search engine called VocalSearch (vocalsearch.org) that aims to find relevant documents based on sung melodic samples, exploring several aspects of musical search: construction of search keys for the database, how to handle singer error, and how to measure similarity among different versions of the same piece of music.
Jon Dunn et al. conclude with a look at the Variations project at Indiana University. This music information system for university music students, teachers, and researchers also functions as a framework for integrating many of the technologies explored throughout this section. It supports exploratory search by connecting visual documents (scores) with audio documents (recorded performances) and provides tools for computer-aided analysis, query-by-example, and score-following functions.
A good resource for learning more about these themes and technologies is the online proceedings of the annual International Conference on Music Information Retrieval (ISMIR) at www.ismir.net/all-papers.html. Comparative evaluations of algorithms for artist identification, genre classification, melody extraction, and beat tracking are available through the annual Music Information Retrieval Evaluation eXchange (MIREX), which takes place in conjunction with ISMIR. Descriptions of evaluation procedures and test collections for all related tasks are available at the MIREX Wiki (www.music-ir.org/mirexwiki/).
To deliver more music information retrieval applications to the public, researchers must still develop content-addressable music search so users can provide samples based on timbre and cross-modal combinations, mixing lyrics, melody, and timbre. Also needed is a way to map human perception, recall, and production of these dimensions. Search based on timbre depends on, for example, finding ways to map measurable quantities (such as spectral tilt) to perceptual qualities (such as scratchy sounds). Being able to recognize lyrics in sung text is, perhaps, the most difficult case of speech recognition, given the noisy environment and greatly distorted speech of a typical performance.
Creating systems with a deep understanding of musical structure is an area informed by the researcher's understanding of how humans process hierarchical structures. Many music-processing tasks (such as automated transcription, instrument recognition, and score alignment) need to be able to separate musical recordings into their component sources. Source separation in a musical context is far from a solved problem; related reverberation and other issues represent yet another research challenge. The results may shed light on how humans parse complex auditory scenes, as well as provide an even richer set of tools for music listeners, researchers, and creators interacting with music documents.
Bryan Pardo (firstname.lastname@example.org) is an assistant professor in the Department of Electrical Engineering and Computer Science with a courtesy appointment in the School of Music at Northwestern University, Evanston, IL.
©2006 ACM 0001-0782/06/0800 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2006 ACM, Inc.