Skip to main content



Audio generation leverages generative machine learning models (e.g., Variational Autoencoders or Generative Adversarial Networks) to create an audio waveform or a symbolic representation of audio (e.g., MIDI). This includes tasks such as music generation and text-to-speech (TTS). These generative models can be unconditioned (e.g., generating any kind of music without user input) or conditioned (e.g., generating jazz-rock played on a cello where the first eight bars are the same as Beethoven’s Fifth Symphony). Conditional audio generation has the potential to enable novel tools for composers, dialogue editors for film and podcasts, and sound designers. For further publications in this area, see our publications page.

  • Speech conversation icon

    Controllable Speech Generation

    Nuances in speech prosody (i.e., the pitch, timing, and loudness of speech) are a vital part of how we communicate. We develop generative machine learning models that use interpretable, disentangled representations of speech to give control over these nuances and generate speech reflecting user-specified prosody.

  • MaskMark

    MaskMark - Robust Neural Watermarking for Real and Synthetic Speech

    High-quality speech synthesis models may be used to spread misinformation or impersonate voices. Audio watermarking can combat misuse by embedding a traceable signature in generated audio. However, existing audio watermarks typically demonstrate robustness to only a small set of transformations of the watermarked audio. To address this, we propose MaskMark, a neural network-based digital audio watermarking technique optimized for speech.

  • System description

    VampNet - Music Generation via Masked Acoustic Token Modeling

    We introduce VampNet, a masked acoustic token modeling approach to music audio generation. VampNet lets us sample coherent music from the model by applying a variety of masking approaches (called prompts) during inference. Prompting VampNet appropriately, enables music compression, inpainting, outpainting, continuation, and looping with variation (vamping). This makes VampNet a powerful music co-creation tool.

  • Treble clef

    Symbolic music generation

    Symbolic music generation uses machine learning to produce music in a symbolic form, such as the Musical Instrument Digital Interface (MIDI) format. Generating music in a symbolic format has the advantages of being both interpretable (e.g., as pitch, duration, and loudness values) and editable in standard digital audio workstations (DAWs).


Improving audio production tools meaningfully enhances the creative output of musicians, podcasters, producers and videographers. We focus on bridging the gap between the intentions of creators and the interfaces of audio recording and manipulation tools they use. Our work in this area has a strong human-centered machine learning component. Representative projects in the area are below. For further publications in this area, see our publications page.

  • Audacity logo

    Deep Learning Tools for Audacity

    We provide a software framework that lets deep learning practitioners easily integrate their own PyTorch models into the open-source Audacity DAW. This lets ML audio researchers put tools in the hands of sound artists without doing DAW-specific development work.

  • Man with hands over his eyes

    Eyes Free Audio Production

    This project focuses on building novel accessible tools for creating audio-based content like music or podcasts. The tools should support the needs of blind creators, whether working independently or on teams with sighted collaborators.

  • a cartoon harp

    HARP - Bringing Deep Learning to the DAW with Hosted, Asynchronous, Remote Processing

    HARP is an ARA plug-in that allows for hosted, asynchronous, remote processing of audio with deep learning models. HARP works by routing audio from a digital audio workstation (DAW) through Gradio endpoints. Because Gradio apps can be hosted locally or in the cloud (e.g., HuggingFace Spaces), HARP lets users of Digital Audio Workstations (e.g. Reaper) access large state-of-the-art models in the cloud, without breaking their within-DAW workflow.

  • Picture of the SynthAssist user interface

    Audio production interfaces that learn from user interaction

    We use metaphors and techniques familiar to musicians to produce customizable environments for music creation, with a focus on bridging the gap between the intentions of both amateur and professional musicians and the audio manipulation tools available through software.


Neural network-based audio interfaces should be robust to various input distortions, especially in sensitive applications. We study the behavior of audio models under maliciously-crafted inputs - called adversarial examples - in order to better understand how to secure audio interfaces against bad-faith actors and naturally-occurring distortions. For further publications in this area, see our publications page.

  • Adaptive filtering

    Audio adversarial examples with adaptive filtering

    We demonstrate a novel audio-domain adversarial attack that modifies benign audio using an interpretable and differentiable parametric transformation - adaptive filtering. Unlike existing state-of-the-art attacks, our proposed method does not require a complex optimization procedure or generative model, relying only on a simple variant of gradient descent to tune filter parameters.

  • VoiceBlock

    Privacy through Real-Time Adversarial Attacks with Audio-to-Audio Models

    As governments and corporations adopt deep learning systems to apply voice ID at scale, concerns about security and privacy naturally emerge. We propose a neural network model capable of inperceptibly modifying a user’s voice in real-time to prevent speaker recognition from identifying their voce.


Audio source separation is the process of extracting a single sound (e.g. one violin) from a mixture of sounds (a string quartet). This is an ongoing research area in the lab. Source separation is the audio analog of scene segmentation in computer vision and is a foundational technology that improves or enables speech recogntion, sound object labeling, music transcription,hearing aids and other technologies. For further publications in this area, see our publications page.

Content-addressable search through collections of many audio files (thousands) or lengthy audio files (hours) is an ongoing research area. In this work, we develop and apply cutting edge techniques in machine learning, signal processing and interface design. This is part of a collaboration with the University of Rochester AIR lab and is supported by the National Science Foundation. Representative recent projects in this area are below. For further publications in this area, see our publications page.

  • ISED logo


    Interactive Sound Event Detector (I-SED) is a human-in-the-loop interface for sound event annotation that helps users label sound events of interest within a lengthy recording quickly. The annotation is performed by a collaboration between a user and a machine.

  • Hierarchical Prototypical Networks

    Leveraging Hierarchical Structures for Few-Shot Musical Instrument Recognition

    In this work, we exploit hierarchical relationships between instruments in a few-shot learning setup to enable classification of a wider set of musical instruments, given a few examples at inference.

  • Voogle logo


    Voogle is an audio search engine that lets users search a database of sounds by vocally imitating or providing an example of the sound they are searching for.