/ blog / mistral-tagged-tts
Mistral voice experiments (calm vs multi-tag)
Mistral Voxtral intro tests: matter-of-fact and calm single-tag passes vs multi-tag, with Eleven ref clips archived on the post.
2026-05-17 · AI, Dev, Video · by The Silicon Based Life Form
this is how i generate voice for channel intros: mistral voxtral tts, a set of elevenlabs reference clips, and a small node script. the output is usable for drafts. it is not broadcast quality. listen before you assume otherwise.
why mistral at all
main reason: cost. narrator saas typically runs $20–30/month whether you publish or not. mistral voxtral bills per api call when you are on a paid tier — not a seat license.
i am on mistral's experiment plan: no credit card, full api access for prototyping. text, vision, and voxtral tts are available while you test. this post cost $0 in mistral fees. elevenlabs ref clips were a separate one-time cost.
the experiment tier is for prototyping, not production traffic. limits:
- rate limits: about 1 request per second, ~30 per minute across the platform.
- volume: up to ~1 billion text tokens per month (audio counts toward usage differently, and is tighter on this tier).
- purpose: testing and prototyping only. a public app will hit rate-limit errors quickly. our tagged pipeline uses one api call per script line, so batch runs add up.
for higher throughput, add a payment method on the mistral console billing page and move to a paid commercial tier.
elevenlabs was used once to generate short mood clips (neutral, calm, excited, sigh, etc.). copies for this post are in public/blog/mistral-tagged-tts/elevenlabs-refs/ (12 clips + manifest.json) so they survive if the video repo paths move.
single-tag tests (full intro, one api call)
same script text for each test: one tag, one line, one mistral call. no ref changes mid-read. punctuation carries the pacing. the scripts below omit exclamation marks — with them, mistral tended to overshoot energy and the takes were less representative.
test 1 — matter-of-fact
default lane in our manifest. the eleven ref is a short neutral read (el-ref-01-neutral.mp3):
elevenlabs ref — matter-of-fact (source)
mistral output from that ref. this is the closest to "one voice, one take" we got — still tts, still not a mic, but it tracks the ref better than the calm pass below:
mistral — matter-of-fact tag, full intro
[matter-of-fact] Hi Guys. How are you? I'm Oliver. This is my channel. I talk about AI, automation, stuff like that. Right. I hope you find something useful here. Oh, by the way, this is not my real voice. I mean it is my real voice. It's processed through text-to-speech. Make sense? More about me. I love coffee, and I have a cat. OK? Cool. Glad you're here.
test 2 — calm
same pipeline, [calm] ref instead. the source clip sounds calm:
elevenlabs ref — calm (source)
mistral then renders the full intro from that ref. same tag, same pipeline — but the output does not read as calm. too much coffee, maybe. or the model treating a long paragraph as neutral-forward. judge for yourself:
mistral — calm tag, full intro
[calm] Hi Guys. How are you? I'm Oliver. This is my channel. I talk about AI, automation, stuff like that. Right. I hope you find something useful here. Oh, by the way, this is not my real voice. I mean it is my real voice. It's processed through text-to-speech. Make sense? More about me. I love coffee, and I have a cat. OK? Cool. Glad you're here.
test 3 — multiple refs, one tag per line
each line starts with a tag — [bored], [excited], [nervous] — which selects an eleven ref clip from manifest.json. mistral does not parse tags as directions; they only route reference audio. one api call per line. ffmpeg concatenates segments with ~180ms between them.
intent: more expression per beat. result: timbre shifts that often read as a different speaker, not a mood change. some segments are fine. others do not match the line before or after.
multi-tag
[bored] Hi Guys!! [excited] How are you!? [calm] I'm Oliver. This is my channel. I talk about AI, automation. stuff like that. [sighs] Right. [excited] I hope you find something useful here. [nervous] Oh, by the way, this is not my real voice. [relieved] I mean it is my real voice. [chuckles] But it's processed through text-to-speech. [gentle] Make sense? [nervous] More about me. I love coffee, and [excited] I have a cat. OK? Cool. Glad you're here!
later pass: matter-of-fact and calm only, longer lines, fewer tag changes at beat points. improved consistency. still not studio narration.
how the pipeline works
- write a script with optional
[tag]prefixes - tag → eleven ref mp3 (repo:
video/_media/audio/refs/el/, blog archive:elevenlabs-refs/) - mistral voxtral: text + ref audio → mp3 chunk
- ffmpeg concat (~180ms gap between chunks)
cd video npm run eleven:refs-all # once — mint ref clips npm run mistral:tagged -- _media/scripts/mistral-tagged-oliver-intro-calm-only.txt mistral-oliver-intro-calm-only.mp3
remotion output stays in video/public/generated/. copies for this post live in public/blog/mistral-tagged-tts/.
what we found
- fewer tags, longer lines — single matter-of-fact or calm pass beats a tag per sentence.
- ref swaps change timbre more than delivery; plan for that or stay on one ref.
[sighs]and similar tags only select a clip; mistral does not perform them as ssml.- fine for drafts and internal video. not a replacement for recording, if you care about consistency.