Clippings - March 2026 | Manuel Sam Ribeiro

This March 2026 edition of Clippings covers a busy month in speech and audio AI, with launches of new TTS models and evaluation frameworks alongside the usual mix of LLM engineering guides and research.

Highlights

Voxtral TTS

TTS Speech Mistral

Mistral’s Voxtral TTS is an expressive multilingual text-to-speech model that generates natural-sounding speech from as little as 3 seconds of reference audio. It combines autoregressive generation of semantic speech tokens with flow-matching for acoustic tokens — a hybrid design that aims to capture both the high-level structure of speech and its fine-grained acoustic detail. The model is zero-shot, requiring no fine-tuning for new speakers or languages, and the paper reports strong results across multiple languages. It’s an interesting architectural choice at a moment when the field is debating the right way to combine discrete and continuous representations for speech synthesis.

Gemini 3.1 Flash Live: Making Audio AI More Natural and Reliable

Speech Product Google

Google’s Gemini 3.1 Flash Live is described as a low-latency streaming audio model that handles natural conversation. The model is marketed for interactive voice applications — live calls, voice assistants, and real-time transcription — with a focus on reliability and naturalness rather than raw capability alone. The post is light on architectural details, but the model card suggests that this model shares an architecture with Gemini 3 Pro , which is a sparse mixture-of-experts (MoE) transformer-based model.

Sparse MoE models activate a subset of model parameters per input token by learning to dynamically route tokens to a subset of parameters (experts); this allows them to decouple total model capacity from computation and serving cost per token.

VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation

TTS GitHub Research

VoxCPM from OpenBMB is a TTS system that eliminates the discrete speech tokenizer entirely. Instead of converting audio into a fixed vocabulary of tokens before modeling, it operates in a continuous representation space, which the authors claim enables better context sensitivity and more faithful voice cloning. The repository includes code and model weights, and a link to the technical report . It’s a notable entry in the growing debate about whether tokenization is the right inductive bias for speech generation.

LuxTTS: 150× Realtime Voice Cloning

TTS Voice Cloning GitHub

LuxTTS is built on top of ZipVoice , which is a non-autoregressive zero-shot spoken dialogue generation model based on flow matching (paper ). This version is distilled to 4 steps with an improved sampling technique. It also changes the default 24khz vocoder to a custom 48khz vocoder.

AD-FlowTSE: Adaptive Deterministic Flow Matching for Target Speaker Extraction

Speech Speaker Extraction ICASSP 2026

Target speaker extraction pulls a single voice out of a noisy, multi-speaker recording. AD-FlowTSE was presented at ICASSP 2026 and applies deterministic flow matching to the task of target speaker extraction. The project page covers the approach and links to the associated paper. Flow matching continues to appear as one of the core techniques in speech generation.

An Introduction to Flow Matching

Generative Models Maths Deep Dive

Flow matching has become one of the dominant generative modeling frameworks, underpinning a growing share of both audio and image models. This post from the Cambridge Machine Learning Group is among the clearer technical introductions available. It builds up from normalising flows to the flow matching objective carefully, with a focus on mathematical structure rather than intuition alone.

TurboQuant: Redefining AI Efficiency with Extreme Compression

LLM Quantization ICLR 2026

Google Research’s TurboQuant, presented at ICLR 2026 , achieves 5× KV cache compression at 3-bit precision while maintaining 99.5% attention fidelity — a result that, if it generalises well, has direct implications for serving large models at scale. The blog post is a readable overview of the approach; for implementers there’s also an independent PyTorch reimplementation (tonbistudio/turboquant-pytorch ) and a vLLM integration (mitkox/vllm-turboquant ). Quantization of attention caches is an increasingly competitive research area, and TurboQuant appears to be pushing the boundary meaningfully.

Agentic Engineering Patterns

Agentic Guide Tooling

Simon Willison’s guide collects hard-won patterns for getting reliable results out of coding agents like Claude Code and Codex. It covers the principles behind effective agentic workflows, common failure modes, and how to think about agent autonomy in real software projects. The framing is practical rather than theoretical. This appears to be a strong resource for people starting to learn how to leverage the true power of agentic engineering.

The 8 Levels of Agentic Engineering

Agentic Engineering Opinion

Bassim Eledath maps out a progression from tab-completion to fully autonomous agent teams, arguing that the gap between AI coding capability and actual developer productivity is a skills and workflow problem rather than a model problem. Each level describes a different mode of human-AI collaboration and what it demands from the developer. This post pairs well with the Willison guide above.

Research Papers

Speech & Audio

VoXtream2: Full-stream TTS with Dynamic Speaking Rate Control — Full-stream TTS systems must begin speaking before the full input text is available, but existing approaches lose controllability once streaming has started. VoXtream2 introduces dynamic speaking-rate control that can be updated mid-utterance on the fly, responding to incremental text input. This makes it more practical for interactive dialogue systems where latency and real-time adaptability both matter.
VoiceSculptor: Your Voice, Designed By You — Open-source TTS systems have lagged behind commercial ones in offering fine-grained, instruction-following control over speech attributes. VoiceSculptor bridges this gap with a unified system that accepts natural-language instructions to simultaneously control pitch, speaking rate, age, emotion, and style — without requiring structured parameter inputs. It’s notable as an open-source system that attempts to match the expressiveness typically found only in proprietary products.
TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment — LLM-based TTS systems typically rely on fixed-frame-rate acoustic tokenization, which produces very long token sequences — a poor fit for sequence modeling at scale. TADA introduces a dual text-acoustic alignment objective that reduces this mismatch, yielding more compact representations without sacrificing zero-shot generation quality. The approach allows LLM-based TTS to scale more efficiently while retaining the expressiveness of token-based synthesis.
MAEB: Massive Audio Embedding Benchmark — Audio embedding models have been evaluated inconsistently across fragmented benchmarks, making cross-model comparison unreliable. MAEB unifies evaluation across 30 tasks spanning speech, music, environmental sounds, and audio-text reasoning in 100+ languages, covering 50+ models under a common framework. A key finding is that no single model dominates across all tasks, suggesting the field still lacks a general-purpose audio embedding.
Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction — End-to-end spoken dialogue systems are predominantly benchmarked on synthetic speech and single-turn exchanges, which don’t reflect real conversational conditions. Audio MultiChallenge introduces a multi-turn evaluation suite using natural human speech, covering disfluencies, interruptions, and contextual dependencies that arise in practice. This exposes systematic weaknesses in current E2E systems that single-turn synthetic benchmarks consistently miss.
Soft Clustering Anchors for Self-Supervised Speech Representation Learning in JEPA — Joint Embedding Predictive Architectures (JEPA) are promising for self-supervised speech learning but prone to representation collapse without explicit grounding. GMM-Anchored JEPA addresses this by fitting a Gaussian Mixture Model on log-mel spectrograms once and using the resulting soft cluster assignments as stable anchor targets during training. This prevents collapse without requiring a momentum encoder or a discrete speech tokenizer.
The Design Space of Tri-Modal Masked Diffusion Models — Most multimodal diffusion models are bimodal and initialised from a pretrained unimodal base, limiting their joint modelling capacity. This paper introduces the first tri-modal masked diffusion model — covering text, image, and audio — pretrained from scratch rather than adapted from a single-modality checkpoint. It systematically explores the architecture and training choices specific to tri-modal pretraining, establishing a new design baseline for joint discrete diffusion.
LongCat-AudioDiT — Audio diffusion transformer from Meituan targeting long-context audio generation (paper PDF on GitHub; limited metadata available).

LLM Architecture & Training

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models | [Code ] — DeepSeek introduces “conditional memory” as a complementary sparsity axis to MoE, implementing learned knowledge lookup as a native transformer primitive.
GLM-5: from Vibe Coding to Agentic Engineering — Next-generation foundation model targeting agentic, reasoning, and coding capabilities with reduced training and inference costs.
On Surprising Effectiveness of Masking Updates in Adaptive Optimizers — Shows that randomly masking parameter updates in RMSProp consistently outperforms recent dense adaptive optimizers for LLM training.
Decoding as Optimisation on the Probability Simplex — Reframes top-k, top-p (nucleus), and best-of-k sampling as instances of a principled regularised optimisation problem, unifying previously ad-hoc decoding choices.
KV Cache Transform Coding for Compact Storage in LLM Inference — Applies transform coding to KV caches to enable efficient multi-turn reuse without excessive GPU memory consumption.

Other

Why AI Systems Don’t Learn and What to Do About It — Proposes a cognitive science-inspired architecture with separate systems for observational and active learning, addressing gaps in autonomous AI adaptation.

Also Worth a Look

A New Framework for Evaluating Voice Agents (EVA) : Evaluation framework from ServiceNow AI for benchmarking voice agents end-to-end across speech understanding, dialogue, and synthesis.
Did GPT 5.2 Make a Breakthrough Discovery in Theoretical Physics? : Careful analysis of a claim that GPT 5.2 independently derived a result about single-minus gluon amplitudes in particle physics.
KittenTTS : State-of-the-art TTS model under 25MB — an attempt at extreme model compression for on-device speech synthesis.
speech2speech_fullduplex : Full-duplex speech-to-speech pipeline with barge-in support and multiple agent personalities in English and Hindi.
SpeechBrain : Long-established PyTorch speech toolkit covering ASR, TTS, speaker recognition, and more.
Quantization from the Ground Up : Thorough 6,000-word guide to quantization for LLMs, from fundamentals through to practical compression techniques.
KV Caching in LLMs, Explained Visually : Clear visual explanation of KV caching mechanics, useful for interviews or onboarding.
LLM Architecture Gallery : Architecture figures and fact sheets for a large collection of LLMs, updated regularly by Sebastian Raschka.
LLM Visualization : Interactive 3D visualisation of transformer internals — good for building intuition.
Deep Dive into LLMs like ChatGPT : General-audience deep dive into LLM training and architecture from first principles.
MLE Interview 2.0: Research Engineering and Scary Rounds : Detailed 64-minute read on ML research engineer interview rounds at top AI labs.
ML/Research Engineer Interviews 2025 : First-person account of ML research engineer interviews at Anthropic, DeepMind, Midjourney, Runway, Suno, and others.
Ollama vs Llama.cpp: Complete Comparison Guide : Beginner-friendly breakdown of the difference between llama.cpp and Ollama for local LLM inference.
parameter-golf : OpenAI’s challenge to train the smallest LM that fits in 16MB — a fun community exercise in extreme efficiency.
Learn Claude Code Interactively : 11 interactive modules covering Claude Code from first slash command to building plugins, with terminal simulators and quizzes.
Claude Code Roadmap : Step-by-step learning roadmap for Claude Code on roadmap.sh.
Emergent Mind : arXiv research explorer with paper summaries, video overviews, and email digests.
Introduction to LLM Post-Training Techniques : Slide deck by Maxime Labonne covering SFT, preference alignment, RL, dataset creation, and evaluation.
Google DeepMind Unified Latents (UL) : Summary of UL, a framework that jointly regularises latents using a diffusion prior and decoder.
MoonshotAI/Attention-Residuals : Moonshot AI research code for attention residual mechanisms in transformer architectures.
Essential Machine Learning Equations : Reference guide for key ML equations with working Python implementations.
aman.ai : Large collection of AI/ML concept guides, from classical algorithms to LLM architectures.
AI by Hand : Substack by Prof. Tom Yeh covering math, algorithms, and architectures drawn by hand.
System Design Handbook + Architecture Patterns Playbook : Google Drive folder with system design reference materials.
NVIDIA PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Native Swift with MLX On-device speech recognition, synthesis, and understanding for Mac and iOS. Runs locally on Apple Silicon — no cloud, no API keys, no data leaves your device.