I currently lead applied research on large foundational models for spoken conversational agents and speech-to-speech systems, with a current focus on language expansion and multilingual voices. I owned the speech generation technical roadmap for speech-to-speech, from early exploration to product launch, delivering SOTA voices in 5 languages. Earlier, as technical lead, I brought TTS voices, voice conversion systems, and ultra-lightweight on-device voices from research to production, built on limited training data.
Manuel Sam Ribeiro
Experience
As a postdoctoral researcher, I conducted independent and collaborative research in speech recognition, speaker diarization, and ultrasound tongue imaging. My work focused on developing machine learning solutions to assist speech therapists diagnose and treat speech sound disorders in children. I was awarded a Carnegie Trust research grant as Principal Investigator, studying automatic speech recognition from ultrasound images of the tongue.
I improved the prosody of text-to-speech voices by modeling long-term intonation patterns.
I developed text-to-speech voices for European and Brazilian Portuguese, language models for ASR, and led a project that delivered text normalization and inverse text normalization rules for 10 European languages.
Education
Research
Selected publications. Full list on Google Scholar & Amazon Science.
Interspeech 2023
Improving Grapheme-to-Phoneme Conversion by Learning Pronunciations from Speech Recordings
A method to improve G2P conversion by leveraging pronunciation information extracted directly from speech recordings, improving performance on out-of-vocabulary and domain-specific words.
Interspeech 2023
Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech
A comparison of flow-based, diffusion-based, and L1/L2 approaches for prosody and acoustic modelling in TTS. Flow-based models achieve the best spectrogram quality, while both diffusion and flow-based prosody predictors significantly outperform standard L2-trained models.
Speech Communication 2021
Exploiting ultrasound tongue imaging for the automatic detection of speech articulation errors
A system for automatic detection of speech sound disorders in children, combining audio and ultrasound modalities. Correctly identified 86.6% of articulation errors flagged by clinicians, with potential for integration into ultrasound-based therapy software for automated progress monitoring.
Research Grant (PI)
Silent Speech Interfaces for all - recognising speech from ultrasound images of the tongue
Carnegie Trust-funded research project investigating silent speech recognition from ultrasound tongue imaging. Produced multi-speaker corpora and models with applications for individuals with speech and communication disabilities.
ICASSP 2019
Speaker-Independent Classification of Phonetic Segments from Raw Ultrasound in Child Speech
CNN-based classification of phonetic segments directly from raw ultrasound tongue images in child speech, supporting automatic speech therapy tools.
PhD Thesis
Suprasegmental representations for the modeling of fundamental frequency in statistical parametric speech synthesis
Novel representations of fundamental frequency for natural prosody generation in statistical parametric speech synthesis. Contributions include wavelet and cosine-based f0 representations, linguistic feature exploration, and hierarchical deep neural network models for TTS.
ICASSP 2015
A multi-level representation of f0 using the continuous wavelet transform and the discrete cosine transform
A compact f0 representation combining the Continuous Wavelet Transform and Discrete Cosine Transform, capturing prosodic variation across multiple scales of the prosodic hierarchy. Improves f0 prediction over traditional short-term approaches with fewer model parameters.