Skip to the content.

The 25-model gauntlet: picking the LLM that interviews your grandparents

I’m building Storykept, a voice-first app that records family stories. You talk, an AI listens, works out what’s missing from the memory, and asks one good follow-up: “what year was that?”, “who else was there?” Then it stitches the answers into a clean chapter.

The part doing the listening is what we call the analyzer. After every spoken turn it grades how complete the story is, pulls out the people, places and dates, and writes the next question, all as strict JSON. It fires on every turn, while a 78-year-old sits there waiting for a reply. So the model behind it has to clear three bars at once:

Picking it took a few days and about 25 models across five providers. Here’s what shook out.

A bench, not a vibe check

A clean transcript (“I was born in 1952 in Vienna”) tells you nothing. Every model handles that. So I wrote deliberately dirty transcripts in seven languages, each carrying:

A model passes a cell only if it keeps the corrected value (1967, never 1965), merges the name and nickname into one person, and answers in the storyteller’s language. Bulgarian is a hard gate. If it can’t hold my dad’s story in Bulgarian, it doesn’t ship.

The harness runs the same prompt against any model and logs latency, token cost, and the raw JSON, so I can read the outputs side by side instead of trusting a leaderboard.

Speed, by language

This is the table that decided it. Median latency, milliseconds, seven languages:

Model EN BG DE ES FR RU ZH
Groq Llama 3.3 70B 565 645 559 459 552 638 545
Gemini 2.5 Flash 1898 1527 1329 1455 1497 1434 1384
gpt-4.1-nano 1311 1319 1312 1101 970 1192 963
gpt-4o-mini 1866 2033 2019 1816 1849 1831 2415
Claude Haiku 4.5 3113 3688 3333 2568 3015 2957 2705

Groq sits in its own column. It’s 2x faster than the next thing and 5x faster than Claude, and it doesn’t wobble by language. When the call fires on every turn, the gap between 570ms and 1.5s is the difference between “the app is listening” and “the app is buffering.” (Gemini 3.1 Flash-Lite isn’t in the table because its latency swung from 1.2s to 7s across runs. Too jumpy to trust live.)

Price

All of them are cheap. The per-call cost of the analyzer rounds to nothing next to the speech-to-text and the final polish:

Model Cost / call Cost / 1,000 calls Avg latency
gpt-5-nano $0.00013 $0.13 ~2,100ms
gpt-4.1-nano $0.00014 $0.14 ~1,170ms
Gemini 3.1 Flash-Lite $0.00017 $0.17 spiky
gpt-4o-mini $0.00021 $0.21 ~1,980ms
Groq Llama 3.3 70B $0.00060 $0.60 ~570ms
Gemini 2.5 Flash $0.00078 $0.78 ~1,500ms
Claude Haiku 4.5 $0.00340 $3.40 ~3,050ms

The cheapest model is ~25x cheaper than the priciest, and across a whole recording that’s a difference of fractions of a cent. Price didn’t make the decision. I list it so you can see it stop mattering.

Quality, by language

Pass marks on the dirty transcripts. ✅ clean (dropped the mistake, merged the names, stayed in language), 🟡 correct but rough, ⚠️ a real miss (kept the wrong value, or split one person into two):

Model EN BG DE ES FR RU ZH
Groq Llama 3.3 70B † ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️
Gemini 2.5 Flash 🟡
gpt-4.1-nano ⚠️ ⚠️ ⚠️
gpt-4o-mini ⚠️ 🟡 ⚠️ ⚠️ ⚠️
Claude Haiku 4.5

† That row of red is on the bare test prompt, with no dedup instruction. More on that below, because it’s the whole twist.

Two things jump out. Claude Haiku is the best extractor of anything I tested, in every language. And the cheap-and-fast tiers (gpt-4.1-nano, gpt-5-nano) leak: they kept the wrong year on three of seven languages. Cheap and unreliable loses to nothing.

Four things the bench taught me

1. Reasoning models are a trap for this job. Our first analyzer was gpt-5-nano, and it ran ~16 seconds a turn because it’s a reasoning model burning thousands of thinking tokens on what is basically classification. Gemini bit me the same way: gemini-3.5-flash looked terrible (5-8s, broken JSON) until I set reasoning_effort: 'none' and it dropped to ~1.1s and clean. If you put a reasoning-capable model on a fast structured task, turn the thinking off before you judge it. OpenAI’s gpt-5 line has the same switch, with a trap: gpt-5.1 dropped the 'minimal' value the others use and only takes 'none', which 400’d every call until I caught it.

2. Clean benchmarks lie. On an earlier, tidier test set, Llama 4 Scout looked like the winner: fastest, cheapest, deduped names on its own. Then I ran it through the dirty transcripts, with stacked corrections and two different people both named George. It split “Harold/Harry” into two people and kept both the wrong and right value on corrections. It scored 2 out of 5. The tidy benchmark had hidden the exact failure that matters for real speech.

3. The bench found a bug in my own production prompt. Testing the challengers, I noticed my shipped analyzer did the same thing: given “we lived in Перник, no, in Кремиковци,” it saved both cities. My prompt literally said “never the mistaken one,” but the models honored that for a lone name or year and dropped it on places and stacked corrections. Those entities become the tags on someone’s family story. I rewrote the instruction to be specific about places, streets and counts, added a final re-scan, and the leak closed. A mean benchmark is a bug-finder for your own system, not just a model picker. That red row above is Llama without the fix. With the production prompt, it deduplicates and corrects cleanly.

4. Cap your agents. I run a lot of this through coding agents, and I let one off the leash with “if it errors, retry and debug” and no time limit. A thinking-model JSON failure sent it into a 35-minute loop writing probe scripts before I killed it. Now every run has a hard timeout and a per-call abort, and no agent gets “debug until it works.” An agent without a time budget doesn’t stop on its own.

What I shipped

I kept Groq Llama 3.3 70B in production. It’s the fastest by a wide margin, its quality gaps close with prompting (and I closed them), and it runs on the same provider as my speech-to-text, so there’s one less network hop. Claude Haiku 4.5 is the documented fallback when I want maximum extraction quality and can spend the latency. Gemini 2.5 Flash is the balanced option if I ever switch providers.

The bigger keeper was the harness. It lists every provider’s current models, scores latency, cost and per-language quality, and runs a dirty-content tier that already caught one shipped bug and saved me from one bad swap. Next time a model drops, picking it is one command instead of three more days.

If you’re putting an LLM on a hot path, build the mean little benchmark first. Test it dirty, turn the thinking off, and measure speed where the user actually feels it.


Storykept records family stories by voice. Private by default, built to outlast its users, tested in Bulgarian because that’s the language my dad tells his stories in. storykept.app.