Skip to main content

Controllable Speech Generation

Speech conversation icon

Max Morrison and Bryan Pardo


Nuances in speech prosody (i.e., the pitch, timing, and loudness of speech) are a vital part of how we communicate. We utilize generative machine learning models to generate prosody with user control over these nuances and generate speech reflecting user-specified prosody.

Fine-grained prosody control

Given a speech recording and a target prosody (e.g., a pitch contour and phoneme durations), how can we transform the speech recording to have the target prosody? This task is called fine-grained prosody editing, and is useful for speech editing applications such as podcast post-production. Audio examples can be found here.

Speech prosody correction

Speech prosody correction is where speech with unnatural pitch, phoneme durations, or loudness is adjusted by a computer to sound more natural. For instance, this can happen when naively copying and pasting audio waveforms of the same speaker. Prosody correction systems have applications in film and podcast dialogue post-production. Audio examples can be found here.

[pdf] M. Morrison, L. Rencker, Z. Jin, N. J. Bryan, J.-P. Caceres, and B. Pardo, “Context-Aware Prosody Correction for Text-Based Speech Editing,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.

[pdf] M. Morrison, Z. Jin, N. J. Bryan, J.-P. Caceres, and B. Pardo, “Neural Pitch-Shifting and Time-Stretching with Controllable LPCNet,” Submitted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.