Skip to main content

Controllable Speech Generation

Speech conversation icon

Max Morrison, Cameron Churchwell, Nathan Pruyne, Bryan Pardo


Nuances in speech prosody (i.e., the pitch, timing, and loudness of speech) are a vital part of how we communicate. We develop generative machine learning models that use interpretable, disentangled representations of speech to give control over these nuances and generate speech reflecting user-specified prosody.

Fine-grained prosody control

Given a speech recording and a target prosody (e.g., a pitch contour and phoneme durations), how can we transform the speech recording to have the target prosody? This task is called fine-grained prosody editing, and is useful for speech editing applications such as podcast post-production. Audio examples can be found here.

Speech prosody correction

Speech prosody correction is where speech with unnatural pitch, phoneme durations, or loudness is adjusted by a computer to sound more natural. For instance, this can happen when naively copying and pasting audio waveforms of the same speaker. Prosody correction systems have applications in film and podcast dialogue post-production. Audio examples can be found here.

[pdf] M. Morrison, L. Rencker, Z. Jin, N. J. Bryan, J.-P. Caceres, and B. Pardo, “Context-Aware Prosody Correction for Text-Based Speech Editing,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.

[pdf] M. Morrison, Z. Jin, N. J. Bryan, J.-P. Caceres, and B. Pardo, “Neural Pitch-Shifting and Time-Stretching with Controllable LPCNet,” Submitted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.

[pdf] M. Morrison, P. Pawar, N. Pruyne, J. Cole, and B. Pardo, “Crowdsourced and Automatic Speech Prominence Estimation,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, April 14-19, 2024

C. Churchwell, M. Morrison, and B. Pardo, “High Fidelity Neural Phonetic Posteriorgrams,” in ICASSP 2024 Workshop on Explainable AI for Speech and Audio, 2024.