← Back

Clippings - March 2026

This March 2026 edition of Clippings covers a busy month in speech and audio AI, with launches of new TTS models and evaluation frameworks alongside the usual mix of LLM engineering guides and research.


Highlights

TTS Speech Mistral
Mistral’s Voxtral TTS is an expressive multilingual text-to-speech model that generates natural-sounding speech from as little as 3 seconds of reference audio. It combines autoregressive generation of semantic speech tokens with flow-matching for acoustic tokens — a hybrid design that aims to capture both the high-level structure of speech and its fine-grained acoustic detail. The model is zero-shot, requiring no fine-tuning for new speakers or languages, and the paper reports strong results across multiple languages. It’s an interesting architectural choice at a moment when the field is debating the right way to combine discrete and continuous representations for speech synthesis.
Speech Product Google

Google’s Gemini 3.1 Flash Live is described as a low-latency streaming audio model that handles natural conversation. The model is marketed for interactive voice applications — live calls, voice assistants, and real-time transcription — with a focus on reliability and naturalness rather than raw capability alone. The post is light on architectural details, but the model card suggests that this model shares an architecture with Gemini 3 Pro , which is a sparse mixture-of-experts (MoE) transformer-based model.

Sparse MoE models activate a subset of model parameters per input token by learning to dynamically route tokens to a subset of parameters (experts); this allows them to decouple total model capacity from computation and serving cost per token.

TTS GitHub Research
VoxCPM from OpenBMB is a TTS system that eliminates the discrete speech tokenizer entirely. Instead of converting audio into a fixed vocabulary of tokens before modeling, it operates in a continuous representation space, which the authors claim enables better context sensitivity and more faithful voice cloning. The repository includes code and model weights, and a link to the technical report . It’s a notable entry in the growing debate about whether tokenization is the right inductive bias for speech generation.
TTS Voice Cloning GitHub
LuxTTS is built on top of ZipVoice , which is a non-autoregressive zero-shot spoken dialogue generation model based on flow matching (paper ). This version is distilled to 4 steps with an improved sampling technique. It also changes the default 24khz vocoder to a custom 48khz vocoder.
Speech Speaker Extraction ICASSP 2026
Target speaker extraction pulls a single voice out of a noisy, multi-speaker recording. AD-FlowTSE was presented at ICASSP 2026 and applies deterministic flow matching to the task of target speaker extraction. The project page covers the approach and links to the associated paper. Flow matching continues to appear as one of the core techniques in speech generation.
Generative Models Maths Deep Dive
Flow matching has become one of the dominant generative modeling frameworks, underpinning a growing share of both audio and image models. This post from the Cambridge Machine Learning Group is among the clearer technical introductions available. It builds up from normalising flows to the flow matching objective carefully, with a focus on mathematical structure rather than intuition alone.
LLM Quantization ICLR 2026
Google Research’s TurboQuant, presented at ICLR 2026 , achieves 5× KV cache compression at 3-bit precision while maintaining 99.5% attention fidelity — a result that, if it generalises well, has direct implications for serving large models at scale. The blog post is a readable overview of the approach; for implementers there’s also an independent PyTorch reimplementation (tonbistudio/turboquant-pytorch ) and a vLLM integration (mitkox/vllm-turboquant ). Quantization of attention caches is an increasingly competitive research area, and TurboQuant appears to be pushing the boundary meaningfully.
Agentic Guide Tooling
Simon Willison’s guide collects hard-won patterns for getting reliable results out of coding agents like Claude Code and Codex. It covers the principles behind effective agentic workflows, common failure modes, and how to think about agent autonomy in real software projects. The framing is practical rather than theoretical. This appears to be a strong resource for people starting to learn how to leverage the true power of agentic engineering.
Agentic Engineering Opinion
Bassim Eledath maps out a progression from tab-completion to fully autonomous agent teams, arguing that the gap between AI coding capability and actual developer productivity is a skills and workflow problem rather than a model problem. Each level describes a different mode of human-AI collaboration and what it demands from the developer. This post pairs well with the Willison guide above.

Research Papers

Speech & Audio

  • VoXtream2: Full-stream TTS with Dynamic Speaking Rate Control — Full-stream TTS systems must begin speaking before the full input text is available, but existing approaches lose controllability once streaming has started. VoXtream2 introduces dynamic speaking-rate control that can be updated mid-utterance on the fly, responding to incremental text input. This makes it more practical for interactive dialogue systems where latency and real-time adaptability both matter.
  • VoiceSculptor: Your Voice, Designed By You — Open-source TTS systems have lagged behind commercial ones in offering fine-grained, instruction-following control over speech attributes. VoiceSculptor bridges this gap with a unified system that accepts natural-language instructions to simultaneously control pitch, speaking rate, age, emotion, and style — without requiring structured parameter inputs. It’s notable as an open-source system that attempts to match the expressiveness typically found only in proprietary products.
  • TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment — LLM-based TTS systems typically rely on fixed-frame-rate acoustic tokenization, which produces very long token sequences — a poor fit for sequence modeling at scale. TADA introduces a dual text-acoustic alignment objective that reduces this mismatch, yielding more compact representations without sacrificing zero-shot generation quality. The approach allows LLM-based TTS to scale more efficiently while retaining the expressiveness of token-based synthesis.
  • MAEB: Massive Audio Embedding Benchmark — Audio embedding models have been evaluated inconsistently across fragmented benchmarks, making cross-model comparison unreliable. MAEB unifies evaluation across 30 tasks spanning speech, music, environmental sounds, and audio-text reasoning in 100+ languages, covering 50+ models under a common framework. A key finding is that no single model dominates across all tasks, suggesting the field still lacks a general-purpose audio embedding.
  • Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction — End-to-end spoken dialogue systems are predominantly benchmarked on synthetic speech and single-turn exchanges, which don’t reflect real conversational conditions. Audio MultiChallenge introduces a multi-turn evaluation suite using natural human speech, covering disfluencies, interruptions, and contextual dependencies that arise in practice. This exposes systematic weaknesses in current E2E systems that single-turn synthetic benchmarks consistently miss.
  • Soft Clustering Anchors for Self-Supervised Speech Representation Learning in JEPA — Joint Embedding Predictive Architectures (JEPA) are promising for self-supervised speech learning but prone to representation collapse without explicit grounding. GMM-Anchored JEPA addresses this by fitting a Gaussian Mixture Model on log-mel spectrograms once and using the resulting soft cluster assignments as stable anchor targets during training. This prevents collapse without requiring a momentum encoder or a discrete speech tokenizer.
  • The Design Space of Tri-Modal Masked Diffusion Models — Most multimodal diffusion models are bimodal and initialised from a pretrained unimodal base, limiting their joint modelling capacity. This paper introduces the first tri-modal masked diffusion model — covering text, image, and audio — pretrained from scratch rather than adapted from a single-modality checkpoint. It systematically explores the architecture and training choices specific to tri-modal pretraining, establishing a new design baseline for joint discrete diffusion.
  • LongCat-AudioDiT — Audio diffusion transformer from Meituan targeting long-context audio generation (paper PDF on GitHub; limited metadata available).

LLM Architecture & Training

Other


Also Worth a Look


← Bit Manipulation
Introduction →