Gemini 3.1 Flash TTS: Google’s new audio model gives you actual control over AI speech

Gemini 3.1 Flash TTS: Google’s new audio model gives you actual control over AI speech

1 0 0

Google just dropped Gemini 3.1 Flash TTS, their latest text-to-speech model, and honestly, it’s the first time I’ve been genuinely interested in AI voice generation in a while. Not because it sounds better—though it does—but because they finally added something I’ve been wanting for years: granular control.

Let me explain.

Audio tags: the feature that actually matters

The headline here is what Google calls “audio tags.” You embed natural language commands directly into the text input, and the model adjusts vocal style, pace, and delivery accordingly. Want the AI to sound excited? Write “[excited]” before the line. Need it to slow down for dramatic effect? Same deal. It’s simple, but it works.

Here’s a real example from the announcement:

Input: “[calm] The experiment was a success. [excited] We broke every record!”

Output: The first sentence is delivered in a measured, even tone. The second one actually sounds like someone who just won something. No robotic monotone. No weird pitch jumps. Just… natural.

This is a big step forward. Most TTS models force you to accept whatever default delivery they give you. You can maybe tweak speed or pitch, but that’s about it. Google is now letting you control the emotional arc of the speech. That’s not just incremental improvement—that’s a new capability.

Quality and benchmarks

On the Artificial Analysis TTS leaderboard, which uses thousands of blind human preferences, 3.1 Flash TTS scored an Elo of 1,211. For context, that’s competitive with the best models out there. Artificial Analysis also placed it in their “most attractive quadrant” for balancing high-quality speech generation with low cost.

I’ve played around with it a bit (yes, I got early access), and the quality is noticeably better than the previous Flash model. Less robotic, fewer artifacts, better prosody. It’s not perfect—sometimes the emphasis lands on the wrong word—but it’s close enough that I’d use it for production work.

Multi-speaker and multilingual

The model supports native multi-speaker dialogue, which means you can generate conversations with different voices without stitching separate clips together. That’s useful for audiobooks, podcasts, or any application where multiple characters need distinct voices.

Language support covers 70+ languages, which is about what you’d expect from Google at this point. They’ve been building multilingual capabilities for years, and it shows.

Availability and pricing

Gemini 3.1 Flash TTS is rolling out in preview across:

  • Gemini API and Google AI Studio (for developers)
  • Vertex AI (for enterprises)
  • Google Vids (for Workspace users)

Pricing details weren’t shared in the announcement, but given that Google positions this as “low cost” alongside high quality, I’d expect it to be competitive with other API-based TTS services like ElevenLabs or Amazon Polly.

The SynthID watermark

All audio generated with 3.1 Flash TTS is watermarked with SynthID, Google’s AI-generated content identification tool. This is important. Deepfake audio is a real problem, and having a reliable watermark that survives compression, re-encoding, and other transformations is a win for everyone.

SynthID isn’t perfect—no watermarking system is—but it’s better than nothing, and it’s good to see Google baking it into the model by default rather than as an optional add-on.

What I’m still waiting for

Audio tags are great, but I’d love to see:

  • Voice cloning: Being able to generate custom voices from a short sample would make this much more useful for branding and personalization.
  • Real-time streaming: The model is fast, but I want to see latency numbers for live applications like voice assistants or real-time translation.
  • More granular control: Audio tags let you control style and pace, but what about emotion, accent, or age? Those would be killer features.

Bottom line

Gemini 3.1 Flash TTS is a solid step forward for AI speech generation. The audio tags feature is genuinely useful and something I haven’t seen done this well before. The quality is competitive, the multilingual support is broad, and the SynthID watermarking shows Google is thinking about responsible deployment.

If you’re building anything that needs expressive, controllable AI voices, this is worth trying. Head over to Google AI Studio and give it a spin. I think you’ll be pleasantly surprised.

Comments (0)

Be the first to comment!