Episode Details

Mistral: Voxtral TTS, Forge, Leanstral, & what's next for Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample

Published 8 hours ago

Description

Mistral has been on an absolute tear - with frequent successful model launches it is easy to forget that they raised the largest European AI round in history last year. We were long overdue for a Mistral episode, and we were very fortunate to work with Sophia and Howard to catch up with Pavan (Voxtral lead) and Guillaume (Chief Scientist, Co-founder) on the occasion of this week’s Voxtral TTS launch:

Mistral can’t directly say it, but the benchmarks do imply, that this is basically an open-weights ElevenLabs-level TTS model (Technically, it is a 4B Ministral based multilingual low-latency TTS open weights model that has a 68.4% win rate vs ElevenLabs Flash v2.5). The contributions are not just in the open weights but also in open research: We also spend a decent amount of the pod talking about their architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens (typically only applied in the Image Generation space, as seen in the Flow Matching NeurIPS workshop from the principal authors that we reference in the pod).

You can catch up on the paper here and the full episode is live on youtube!

Timestamps

00:00 Welcome and Guests00:22 Announcing Voxtral TTS01:41 Architecture and Codec02:53 Understanding vs Generation05:39 Flow Matching for Audio07:27 Real Time Voice Agents13:40 Efficiency and Model Strategy14:53 Voice Agents Vision17:56 Enterprise Deployment and Privacy23:39 Fine Tuning and Personalization25:22 Enterprise Voice Personalization26:09 Long-Form Speech Models26:58 Real-Time Encoder Advances27:45 Scaling Context for TTS28:53 What Makes Small Models30:37 Merging Modalities Tradeoffs33:05 Open Source Mission35:51 Lean and Formal Proofs38:40 Reasoning Transfer and Agents40:25 Next Frontiers in Training42:20 Hiring and AI for Science44:19 Forward Deployed Engineering46:22 Customer Feedback Loop48:29 Wrap Up and Thanks

Transcript

swyx: Okay, welcome to Latent Space. We’re here in the studio with our gues co-host Vibh u. Welcome. Thanks. Excited for this one as well as Guillaume and Pavan from Mistral. Welcome. Excited to be here.

Guillaume: Thank you.

swyx: Pavan. You are leading audio research at Ms. Charles and Guam, your chief scientist.

Announcing Voxtral TTS

swyx: What are we announcing today where we’re coordinating this release with you guys?

Guillaume: Yeah, so we are releasing box trial TTS. So it’s our first audio model that generates speech. It’s not our first audio model. We had a couple of releases before we had one in the summer that was a box trial, our first audio model, but it’s was like transcription models. A SR like a few months later we released some update on top of this, supporting more languages.

Also a lot of table stake features for our customers context biasing ization. Time stamping and the transcription. We will start some real time model that can transcribe not just at the end of the, you just don’t need to fill your entire audio file, but that can also come in [00:01:00] real time. And here, this is the natural extension in the audio.

So basically speech generation. So yeah. So we support nine languages. And this is a pretty small mode

Episode Details

Mistral: Voxtral TTS, Forge, Leanstral, & what's next for Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample

Description

Listen Now

Love PodBriefly?