How to Evaluate AI Translation Quality: A Practical Guide for Dev Teams

You ran your locale files through an LLM. The output looks plausible. Your German colleague glances at the German translations and says yeah, looks fine. But you have 2,000 keys across 8 languages, and your German colleague doesn't speak Japanese.

How do you actually know the translations are good?

This is the question every dev team hits once they move past the initial excitement of using the OpenAI API to translate an app. The API call is the easy part. Knowing whether the output is production-ready is where it gets interesting.

Have You Heard About BLEU?

If you've looked into machine translation evaluation, you've probably come across BLEU. It's been the standard metric since 2002 and works by comparing word overlap between a translation and a reference. The problem: it was built for statistical MT systems that produced formulaic output. LLMs don't do formulaic.

Delete in Swedish can be Radera (technically correct, sounds harsh) or Ta bort (natural, what Swedish apps actually use). Both are valid. BLEU penalizes one of them. A translation can convey the exact same meaning with completely different words, and BLEU sees that as a failure.

This isn't a fringe opinion. The WMT22 Metrics Shared Task paper was titled Stop Using BLEU. At WMT24, BLEU scored 0.589 in correlation with human judgments while neural metrics hit 0.719-0.725. That's a significant gap.

COMET, the most common neural alternative, does better. It uses cross-lingual models trained on human quality judgments and correlates more closely with what people actually think of translations. But it has its own blind spots. Research from 2024 found that COMET gives empty translations a score of 0.335 (should be zero), can't detect wrong-language output in reference-free mode, and misses the subtle contextual improvements that LLMs actually do well.

For software localization, you need evaluation that cares about things BLEU and COMET weren't designed for: did the translation preserve the {{variable}}? Is the formality right for how the translations will be used? Does the brand voice stay consistent across 2,000 keys?

Using an LLM to Judge Translations

The approach that's worked best for us at Localhero.ai is using an LLM as a judge. Microsoft's GEMBA framework showed this can outperform every traditional metric, hitting 89.8% agreement with human evaluators compared to COMET's 83.9%.

We use Anthropic's Claude to evaluate translations generated by our translation engine (which runs on the latest OpenAI models). The evaluator scores each translation on accuracy, fluency, consistency, brand voice, and technical correctness. These are similar to the dimensions in the MQM framework, it's basically a structured way to do translation quality analysis.

To be able to automate testing we use a different model provider for evaluation than for translation. Research on self-preference bias shows that LLMs rate their own output more favorably. One example from the paper is that GPT-4 correctly identifies its own translations as better 94.5% of the time when humans agree, but only catches its own worse translations 42.5% of the time. Different families, different biases, more honest results.

Beyond the LLM judge, we also use confidence scoring (0-100% per translation, useful for finding which strings need human review), reference comparison against human translations (done semantically, not word-matching), and manual spot-checks to make sure automated assessments match reality.

Our Numbers

For one case we ran our evaluation suite on 50 content types across 9 languages (Spanish, French, Portuguese, Italian, German, Swedish, Norwegian, Dutch, Finnish). Same pipeline as production.

47 out of 50 pass our quality thresholds. That's a 94% pass rate. UI elements (buttons, labels) hit 98.5%. Marketing and error messages come in around 94%. Variable preservation is 100%, we've never seen a {{placeholder}} get mangled.

What does 94% actually mean though? On its own, not that much. The 6% that didn't pass were mostly style choices, a slightly formal tone, or a technically correct but less natural word, like Radera instead of Ta bort for Delete in Swedish. None were meaning errors or broken variables. Worth noting.

Some Examples

The subtle stuff:

Delete → Swedish. A good translation system picks Ta bort (take away, what Swedish apps use) over Radera (erase, technically correct but sounds harsh). This is the kind of thing BLEU would score identically.

Technical content with variables:

Rate limit exceeded. Please wait {{seconds}} seconds before trying again. → German: Rate-Limit überschritten. Bitte warten Sie {{seconds}} Sekunden, bevor Sie es erneut versuchen. Formal Sie form (right for error messages in German), variable preserved correctly. 95/100.

Where it gets tricky:

Keep your API key secure! Never share it publicly or commit it to version control. → German: The translation uses speichern Sie ihn in der Versionskontrolle for commit to version control. It's understandable, but a developer would say committen. That's the kind of edge case that lands in our 6%, functional but not perfect for a technical audience.

Setting Up Evaluation for Your Project

If you want to evaluate translation quality yourself, here's what I'd suggest:

Build a test set. Collect representative strings from your app, the more the better but rather 30 good ones than 200 that don't hold up. UI elements, error messages, marketing copy, strings with variables, content that needs cultural adaptation. Get a native speaker to write reference translations.

Start simple. For small projects (under 100 keys), native speaker review works fine. For larger projects, use an LLM to score translations for accuracy, fluency, and technical correctness. Even a simple prompt works well.

Use a different model family for evaluation. If you translated with OpenAI, evaluate with Claude (or vice versa). This avoids the self-preference bias problem. It's a small change that makes a real difference in evaluation quality.

Define what good enough means for your content. A B2B dashboard can tolerate minor style variations. Medical or legal content can't tolerate any meaning errors. MQM's severity weighting (minor = 1 point, major = 5, critical = 25) is a useful starting point.

Run it regularly. Model updates, new content types, new languages, these all affect quality. If you're automating translations with GitHub Actions, evaluation can be part of the same CI workflow.

Translation quality evaluation doesn't need to be complicated. Start with an LLM judge, use a different provider than your translator, build a small test set with real strings from your app, and iterate from there. You don't need a perfect setup on day one. You just need to know your translations work.

This is based on how we do things at Localhero.ai. If you want to test it out, sign up for a trial or just ping us.