/ blog / mistral-tagged-tts

Mistral voice experiments (calm vs multi-tag)

Mistral Voxtral intro tests: matter-of-fact and calm single-tag passes vs multi-tag, with Eleven ref clips archived on the post.

2026-05-17 · AI, Dev, Video · by The Silicon Based Life Form

this is how i generate voice for channel intros: mistral voxtral tts, a set of elevenlabs reference clips, and a small node script. the output is usable for drafts. it is not broadcast quality. listen before you assume otherwise.

why mistral at all

main reason: cost. narrator saas typically runs $20–30/month whether you publish or not. mistral voxtral bills per api call when you are on a paid tier — not a seat license.

i am on mistral's experiment plan: no credit card, full api access for prototyping. text, vision, and voxtral tts are available while you test. this post cost $0 in mistral fees. elevenlabs ref clips were a separate one-time cost.

the experiment tier is for prototyping, not production traffic. limits:

rate limits: about 1 request per second, ~30 per minute across the platform.
volume: up to ~1 billion text tokens per month (audio counts toward usage differently, and is tighter on this tier).
purpose: testing and prototyping only. a public app will hit rate-limit errors quickly. our tagged pipeline uses one api call per script line, so batch runs add up.

for higher throughput, add a payment method on the mistral console billing page and move to a paid commercial tier.

elevenlabs was used once to generate short mood clips (neutral, calm, excited, sigh, etc.). copies for this post are in public/blog/mistral-tagged-tts/elevenlabs-refs/ (12 clips + manifest.json) so they survive if the video repo paths move.

single-tag tests (full intro, one api call)

same script text for each test: one tag, one line, one mistral call. no ref changes mid-read. punctuation carries the pacing. the scripts below omit exclamation marks — with them, mistral tended to overshoot energy and the takes were less representative.

test 1 — matter-of-fact

default lane in our manifest. the eleven ref is a short neutral read (el-ref-01-neutral.mp3):

elevenlabs ref — matter-of-fact (source)

el-ref-01-neutral.mp3 · eleven v3 ref → mistral ref_audio

mistral output from that ref. this is the closest to "one voice, one take" we got — still tts, still not a mic, but it tracks the ref better than the calm pass below:

mistral — matter-of-fact tag, full intro

single segment · voxtral-mini-tts

[matter-of-fact] Hi Guys. How are you? I'm Oliver. This is my channel. I talk about AI, automation, stuff like that. Right. I hope you find something useful here. Oh, by the way, this is not my real voice. I mean it is my real voice. It's processed through text-to-speech. Make sense? More about me. I love coffee, and I have a cat. OK? Cool. Glad you're here.

plain .txt

test 2 — calm

same pipeline, [calm] ref instead. the source clip sounds calm:

elevenlabs ref — calm (source)

el-ref-11-calm.mp3 · short v3 ref, used as mistral ref_audio

mistral then renders the full intro from that ref. same tag, same pipeline — but the output does not read as calm. too much coffee, maybe. or the model treating a long paragraph as neutral-forward. judge for yourself:

mistral — calm tag, full intro

single segment · same ref · voxtral-mini-tts

[calm] Hi Guys. How are you? I'm Oliver. This is my channel. I talk about AI, automation, stuff like that. Right. I hope you find something useful here. Oh, by the way, this is not my real voice. I mean it is my real voice. It's processed through text-to-speech. Make sense? More about me. I love coffee, and I have a cat. OK? Cool. Glad you're here.

plain .txt

test 3 — multiple refs, one tag per line

each line starts with a tag — [bored], [excited], [nervous] — which selects an eleven ref clip from manifest.json. mistral does not parse tags as directions; they only route reference audio. one api call per line. ffmpeg concatenates segments with ~180ms between them.

intent: more expression per beat. result: timbre shifts that often read as a different speaker, not a mood change. some segments are fine. others do not match the line before or after.

multi-tag

many segments · one ref per tag

[bored] Hi Guys!!

[excited] How are you!? 

[calm] I'm Oliver. 

This is my channel. 

I talk about AI, automation. stuff like that. 

[sighs] Right. 

[excited] I hope you find something useful here.

[nervous] Oh, by the way, this is not my real voice. 

[relieved] I mean it is my real voice. 

[chuckles]

But it's processed through text-to-speech. 

[gentle] Make sense? 

[nervous] More about me. 

I love coffee, and 

[excited] I have a cat.

OK? Cool. Glad you're here!

plain .txt

later pass: matter-of-fact and calm only, longer lines, fewer tag changes at beat points. improved consistency. still not studio narration.

how the pipeline works

write a script with optional [tag] prefixes
tag → eleven ref mp3 (repo: video/_media/audio/refs/el/, blog archive: elevenlabs-refs/)
mistral voxtral: text + ref audio → mp3 chunk
ffmpeg concat (~180ms gap between chunks)

cd video
npm run eleven:refs-all   # once — mint ref clips
npm run mistral:tagged -- _media/scripts/mistral-tagged-oliver-intro-calm-only.txt mistral-oliver-intro-calm-only.mp3

remotion output stays in video/public/generated/. copies for this post live in public/blog/mistral-tagged-tts/.

what we found

fewer tags, longer lines — single matter-of-fact or calm pass beats a tag per sentence.
ref swaps change timbre more than delivery; plan for that or stay on one ref.
[sighs] and similar tags only select a clip; mistral does not perform them as ssml.
fine for drafts and internal video. not a replacement for recording, if you care about consistency.

[home] [blog]

next: more calm-only renders, or a microphone. the microphone is still winning on consistency.

/ blog / mistral-tagged-tts

Mistral voice experiments (calm vs multi-tag)

Mistral Voxtral intro tests: matter-of-fact and calm single-tag passes vs multi-tag, with Eleven ref clips archived on the post.

2026-05-17 · AI, Dev, Video · by The Silicon Based Life Form

why mistral at all

main reason: cost. narrator saas typically runs $20–30/month whether you publish or not. mistral voxtral bills per api call when you are on a paid tier — not a seat license.

the experiment tier is for prototyping, not production traffic. limits:

rate limits: about 1 request per second, ~30 per minute across the platform.
volume: up to ~1 billion text tokens per month (audio counts toward usage differently, and is tighter on this tier).
purpose: testing and prototyping only. a public app will hit rate-limit errors quickly. our tagged pipeline uses one api call per script line, so batch runs add up.

for higher throughput, add a payment method on the mistral console billing page and move to a paid commercial tier.

single-tag tests (full intro, one api call)

test 1 — matter-of-fact

default lane in our manifest. the eleven ref is a short neutral read (el-ref-01-neutral.mp3):

elevenlabs ref — matter-of-fact (source)

el-ref-01-neutral.mp3 · eleven v3 ref → mistral ref_audio

mistral output from that ref. this is the closest to "one voice, one take" we got — still tts, still not a mic, but it tracks the ref better than the calm pass below:

mistral — matter-of-fact tag, full intro

single segment · voxtral-mini-tts

[matter-of-fact] Hi Guys. How are you? I'm Oliver. This is my channel. I talk about AI, automation, stuff like that. Right. I hope you find something useful here. Oh, by the way, this is not my real voice. I mean it is my real voice. It's processed through text-to-speech. Make sense? More about me. I love coffee, and I have a cat. OK? Cool. Glad you're here.

plain .txt

test 2 — calm

same pipeline, [calm] ref instead. the source clip sounds calm:

elevenlabs ref — calm (source)

el-ref-11-calm.mp3 · short v3 ref, used as mistral ref_audio

mistral — calm tag, full intro

single segment · same ref · voxtral-mini-tts

[calm] Hi Guys. How are you? I'm Oliver. This is my channel. I talk about AI, automation, stuff like that. Right. I hope you find something useful here. Oh, by the way, this is not my real voice. I mean it is my real voice. It's processed through text-to-speech. Make sense? More about me. I love coffee, and I have a cat. OK? Cool. Glad you're here.

plain .txt

test 3 — multiple refs, one tag per line

intent: more expression per beat. result: timbre shifts that often read as a different speaker, not a mood change. some segments are fine. others do not match the line before or after.

multi-tag

many segments · one ref per tag

[bored] Hi Guys!!

[excited] How are you!? 

[calm] I'm Oliver. 

This is my channel. 

I talk about AI, automation. stuff like that. 

[sighs] Right. 

[excited] I hope you find something useful here.

[nervous] Oh, by the way, this is not my real voice. 

[relieved] I mean it is my real voice. 

[chuckles]

But it's processed through text-to-speech. 

[gentle] Make sense? 

[nervous] More about me. 

I love coffee, and 

[excited] I have a cat.

OK? Cool. Glad you're here!

plain .txt

later pass: matter-of-fact and calm only, longer lines, fewer tag changes at beat points. improved consistency. still not studio narration.

how the pipeline works

write a script with optional [tag] prefixes
tag → eleven ref mp3 (repo: video/_media/audio/refs/el/, blog archive: elevenlabs-refs/)
mistral voxtral: text + ref audio → mp3 chunk
ffmpeg concat (~180ms gap between chunks)

cd video
npm run eleven:refs-all   # once — mint ref clips
npm run mistral:tagged -- _media/scripts/mistral-tagged-oliver-intro-calm-only.txt mistral-oliver-intro-calm-only.mp3

remotion output stays in video/public/generated/. copies for this post live in public/blog/mistral-tagged-tts/.

what we found

fewer tags, longer lines — single matter-of-fact or calm pass beats a tag per sentence.
ref swaps change timbre more than delivery; plan for that or stay on one ref.
[sighs] and similar tags only select a clip; mistral does not perform them as ssml.
fine for drafts and internal video. not a replacement for recording, if you care about consistency.

[home] [blog]

next: more calm-only renders, or a microphone. the microphone is still winning on consistency.

Mistral voice experiments (calm vs multi-tag)

why mistral at all

single-tag tests (full intro, one api call)

test 1 — matter-of-fact

test 2 — calm

test 3 — multiple refs, one tag per line

how the pipeline works

what we found

related posts

Mistral voice experiments (calm vs multi-tag)

why mistral at all

single-tag tests (full intro, one api call)

test 1 — matter-of-fact

test 2 — calm

test 3 — multiple refs, one tag per line

how the pipeline works

what we found

related posts