Manuel Sam Ribeiro

Senior Applied Scientist
Amazon AGI · Gdańsk, Poland

I am a speech and language researcher working at the boundary between research ambition and production reality. I am currently a Senior Applied Scientist at Amazon AGI, where I build multimodal LLMs and spoken conversational systems. Previously, I was at Apple and Microsoft developing speech synthesis and speech recognition products; and at the University of Edinburgh, where I completed my PhD and held a senior postdoctoral research position.

Research Interests

Multimodal LLMs Spoken conversational AI Speech synthesis & voice conversion Speech-to-speech systems

LinkedIn Github Google Scholar Amazon Science

Experience

Amazon AGI · Gdańsk, Poland 2021 — present

Senior Applied Scientist Apr 2025 — present now

Applied Scientist 2021 — Apr 2025

I currently lead applied research on large foundational models for spoken conversational agents and speech-to-speech systems, with a current focus on language expansion and multilingual voices. I owned the speech generation technical roadmap for speech-to-speech, from early exploration to product launch, delivering SOTA voices in 5 languages. Earlier, as technical lead, I brought TTS voices, voice conversion systems, and ultra-lightweight on-device voices from research to production, built on limited training data.

University of Edinburgh · School of Informatics 2017 — 2020

Senior Postdoctoral Researcher Aug 2020 — Dec 2020

Postdoctoral Researcher 2017 — Aug 2020

As a postdoctoral researcher, I conducted independent and collaborative research in speech recognition, speaker diarization, and ultrasound tongue imaging. My work focused on developing machine learning solutions to assist speech therapists diagnose and treat speech sound disorders in children. I was awarded a Carnegie Trust research grant as Principal Investigator, studying automatic speech recognition from ultrasound images of the tongue.

Apple · Siri Speech · Cupertino, CA 2016

Research Engineer Intern

I improved the prosody of text-to-speech voices by modeling long-term intonation patterns.

Microsoft Language Development Center · Lisbon, Portugal 2007 — 2012

Speech Scientist / Language Expert

I developed text-to-speech voices for European and Brazilian Portuguese, language models for ASR, and led a project that delivered text normalization and inverse text normalization rules for 10 European languages.

Education

PhD, Speech and Language Processing 2013 — 2017

University of Edinburgh · School of Informatics

Thesis: Suprasegmental representations for the modeling of fundamental frequency in statistical parametric speech synthesis

MSc, Speech and Language Processing 2012 — 2013

University of Edinburgh · School of Psychology, Philosophy & Language Sciences

Dissertation: Exploring Discourse-Level Features for Audiobook-based Speech Synthesis

BA & MA, Literature & Linguistics / English Studies 2003 — 2010

University of Lisbon · Faculty of Letters

Dissertation: Mirroring the Mind: Towards an analysis of the psychological space in Lolita by Vladimir Nabokov and its visual manifestations

Research

Selected publications. Full list on Google Scholar & Amazon Science.

2023

Interspeech 2023

Improving Grapheme-to-Phoneme Conversion by Learning Pronunciations from Speech Recordings

A method to improve G2P conversion by leveraging pronunciation information extracted directly from speech recordings, improving performance on out-of-vocabulary and domain-specific words.

ArXiv Amazon ISCA

2023

Interspeech 2023

Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

A comparison of flow-based, diffusion-based, and L1/L2 approaches for prosody and acoustic modelling in TTS. Flow-based models achieve the best spectrogram quality, while both diffusion and flow-based prosody predictors significantly outperform standard L2-trained models.

ArXiv Amazon ISCA

2022

ICASSP 2022

Cross-Speaker Style Transfer for Text-to-Speech Using Data Augmentation

A data augmentation framework for cross-speaker style transfer in neural TTS, enabling a system to adopt a target speaker's style without parallel style data.

ArXiv Amazon IEEE

2021

Speech Communication 2021

Exploiting ultrasound tongue imaging for the automatic detection of speech articulation errors

A system for automatic detection of speech sound disorders in children, combining audio and ultrasound modalities. Correctly identified 86.6% of articulation errors flagged by clinicians, with potential for integration into ultrasound-based therapy software for automated progress monitoring.

ArXiv Edinburgh Research DOI Code

2021

SLT 2021 · Dataset

TaL: Tongue and Lips Corpus

Synchronised ultrasound tongue and lip video from 82 native English speakers. A large-scale resource for speaker-independent articulatory modelling and silent speech research.

ArXiv Docs Code Data

2021

Interspeech 2021

Silent versus modal multi-speaker speech recognition from ultrasound and video

Speaker-independent models for silent and modal speech recognition from ultrasound tongue imaging and lip video, demonstrating viability of silent speech recognition at scale.

ArXiv ISCA

2019

Research Grant (PI)

Silent Speech Interfaces for all - recognising speech from ultrasound images of the tongue

Carnegie Trust-funded research project investigating silent speech recognition from ultrasound tongue imaging. Produced multi-speaker corpora and models with applications for individuals with speech and communication disabilities.

Edinburgh Research

2019

ICASSP 2019

Speaker-Independent Classification of Phonetic Segments from Raw Ultrasound in Child Speech

CNN-based classification of phonetic segments directly from raw ultrasound tongue images in child speech, supporting automatic speech therapy tools.

ArXiv IEEE Edinburgh Research

2018

Interspeech 2018 · Dataset

UltraSuite Repository

Synchronised ultrasound and acoustic data from child speech therapy sessions. Three datasets including children with speech sound disorders, for automatic clinical analysis tools.

Paper Docs GitHub Data

2018

Edinburgh Datashare 2018 · Dataset

Parallel Audiobook Corpus

~121 hours across 4 books and 59 speakers. Parallel readings designed for speech synthesis, voice conversion, and prosody modelling research.

Docs Data

2018

PhD Thesis

Suprasegmental representations for the modeling of fundamental frequency in statistical parametric speech synthesis

Novel representations of fundamental frequency for natural prosody generation in statistical parametric speech synthesis. Contributions include wavelet and cosine-based f0 representations, linguistic feature exploration, and hierarchical deep neural network models for TTS.

Edinburgh Research

2015

ICASSP 2015

A multi-level representation of f0 using the continuous wavelet transform and the discrete cosine transform

A compact f0 representation combining the Continuous Wavelet Transform and Discrete Cosine Transform, capturing prosodic variation across multiple scales of the prosodic hierarchy. Improves f0 prediction over traditional short-term approaches with fewer model parameters.

IEEE Edinburgh Research