Skip to content
July 1, 20266 min readGuides

When AI Translations Break: What Actually Fails, and the Graduation Path

The honest ending of most "should we buy a translation tool?" debates in 2026 is: run the JSON through an LLM in CI and move on. It is fast, near-free, and for a small app it produces something genuinely usable. We have said this ourselves, in public, on our why not just use AI page.

This post is about what happens next, because there is a next. AI-only translation pipelines do not fail loudly on day one. They fail quietly, months in, in a handful of specific and predictable ways. If you run one, this is a field guide to the failure modes, and to the graduation path that fixes them without throwing the pipeline away.

Key facts
  • What breaks: terminology drift, formality flips, plural/ICU edge cases, legal strings shipped unreviewed, and no answer to "who approved this?"
  • Why prompts don't fix it: consistency and accountability are data and workflow problems. A stateless model call has no glossary, no memory of last month's decisions, and no audit trail.
  • The graduation path: keep your LLM (bring your own key), add glossary + styleguide context, score every translation (Quality Estimation), route only low-confidence strings to human review, deliver over a CDN.
  • When not to graduate: small project, few locales, no external users. AI-only is a legitimate choice there.

What actually breaks

Terminology drift. Each model call is stateless. Nothing remembers that "Drive" is your product name, that "Abo" was the word you chose for subscription in German, or that your Spanish uses "tú" and not "usted". Across hundreds of strings and months of incremental translation runs, the same concept accumulates three or four renderings. Users notice before you do, because they see the screens side by side.

Formality and voice flips. Related, but nastier: the register changes mid-app. One screen addresses the user formally, the next casually. In languages where this distinction is grammatical (German, French, Japanese, Korean), an inconsistent register reads as broken, not as a style choice.

Plurals and ICU edge cases. English has two plural forms; Polish and Arabic have more, with rules a model applies correctly in isolation and inconsistently in bulk. Interpolated variables inside ICU MessageFormat strings are easy to mangle in a batch run, and a broken placeholder is not a style problem, it is a runtime bug.

Strings that carry risk. Legal notices, medical wording, pricing terms, accessibility labels. These are exactly the strings where "the model is usually right" is not an acceptable quality bar, and exactly the ones an unreviewed batch pipeline ships like any other string.

No answer to "who approved this?" The first time a customer, a lawyer, or an auditor asks why the app said what it said in Italian, an AI-only pipeline has one answer: a git commit by a bot. No reviewer, no decision trail, no quality score. For teams facing the EU AI Act's transparency obligations (Article 50 applies from August 2, 2026), that question stops being hypothetical; our Article 50 readiness check covers what machine-translated content does and does not trigger.

The public example of the whole pattern arrived in November 2025, when Mozilla moved its support content to AI-first localization and its long-standing Japanese volunteer community resigned in response. The most-agreed criticism in that very long thread was not "AI translated it". It was that nothing enforced terminology and style guidelines, and native speakers found the result worse than nothing. That is the failure mode in one sentence: not translation quality on average, but ungoverned quality variance.

Why better prompts don't fix it

The instinctive fix is prompt engineering: paste the glossary into the prompt, add style instructions, re-run. It helps, and it is also a treadmill:

  • The context does not scale. Your glossary, style rules, and past decisions grow; context windows and attention do not keep up with "here are 400 terminology decisions, apply all of them consistently across 3,000 strings".
  • There is no memory between runs. Last month's careful fixes are not training data for this month's batch. Fixed strings regress when a source string changes and gets re-translated.
  • There is still no gate. Even a perfect prompt produces output that ships unreviewed. The problem was never only translation quality; it is that nothing stands between the model and production.

Consistency is a data problem (glossary, translation memory), quality is a measurement problem (scoring), and accountability is a workflow problem (review with history). None of the three is a prompting problem.

The graduation path: keep the pipeline, add the layer

Graduating from AI-only does not mean hiring an agency or abandoning automation. Concretely, with Locize it looks like this; every piece is incremental:

  1. Keep your model, give it context. Automatic translation runs on your own OpenAI, Gemini or Mistral key (or the built-in service). Your glossary and styleguide are injected into every prompt, so the terminology and register decisions you have already made are enforced on every future string, automatically.
  2. Score everything. Quality Estimation rates each AI translation from 0 to 1 and flags concrete issues. You choose the threshold (0.7 by default).
  3. Review only what needs eyes. Confident translations save directly; low-confidence ones route to the review workflow, where a reviewer sees them in context on the running app. Accept/decline decisions are recorded with history and can be exported as provenance evidence.
  4. Ship without redeploys. Approved translations publish over a global CDN. Your CI pipeline keeps running; the "commit the JSON and redeploy for a typo" step disappears.

The net effect: the same LLM does the same work, but terminology stops drifting, risky strings get human eyes, and every translation in production has an answer to "who approved this, and how confident were we?".

When you should not graduate

Honesty cuts both ways. If you run a side project, translate into two or three languages you can personally sanity-check, and no revenue or compliance depends on the copy, an LLM in CI plus JSON in git is a perfectly sound setup, and cheaper than any tool. The graduation signals are concrete: a language nobody on the team reads, a translator or reviewer who is not a developer, a term that must never vary, a string a lawyer cares about, or a user-facing quality complaint you could not trace. The week one of those appears is the week the management layer starts paying for itself.

A zero-commitment way to see where you stand: drop your locale files into the free i18n health check. It runs entirely in your browser (nothing is uploaded) and shows the missing keys, duplicate source values and interpolation mismatches your current pipeline has already produced.

If several of those signals sound familiar, start free with Locize, connect your existing pipeline (your key, your model), and turn on Quality Estimation for the next batch: you will see the score distribution of your current AI output before changing anything else. The why not just use AI page has the honest version of the whole trade-off.

Tired of managing translations by hand?

Locize is the translation management backend by the i18next team: CDN delivery, AI translation, in-context editing, no redeploys.

Start your free 14-day trial