Skip to main content

Controllable Speech Generation

Speech conversation icon

Max Morrison and Bryan Pardo

Nuances in speech prosody (i.e., the pitch, timing, and loudness of speech) are a vital part of how we communicate. We utilize generative machine learning models to generate prosody with user control over these nuances and generate speech reflecting user-specified prosody.

Speech prosody correction is an example of this, where speech with unnatural pitch, phoneme durations, or loudness is adjusted by a computer to sound more natural. For instance, this can happen when naively copying and pasting audio waveforms of the same speaker. Prosody correction systems have applications in film and podcast dialogue post-production.

Prosody correction video

[pdf] M. Morrison, L. Rencker, Z. Jin, N. J. Bryan, J.-P. Caceres, and B. Pardo, “Context-Aware Prosody Correction for Text-Based Speech Editing,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.