The OpenAI API Can Translate Your App. The Hard Part Is Everything Else.

Any developer can call the OpenAI API with translate this JSON to Swedish and get a reasonable result back. Modern LLMs are genuinely impressive at translation. The hard part is making it into a well-oiled workflow for your team: getting context-aware translations that fit your product, keeping terminology consistent, letting non-technical teammates review without touching code, fitting translations into your development workflow, and catching quality issues before your users do. On every change in your product.

Localhero.ai started out with the same idea, a simple script should be enough. I've documented a few different ways to get it working with GitHub Actions if you want to explore that path. But when you break down the translation problem, you start to realize how much needs to work around it to truly make shipping a localized product nearly as easy as shipping a single-language one. This post covers some of the things that come up when weighing the build-it-yourself vs. service approach to AI translation automation.

I've spent years building internationalized products, both as a developer and as a founder. Translation workflows are one of those things that quietly break product velocity: developers don't want to think about localization, product people want it to just work, and the internal processes that fill the gap, spreadsheets, Slack threads, manual copy-paste, are slow and no fun.

Where the script stops being enough

You already support a couple of languages and localization has always been a manual process. Someone updates a spreadsheet, someone else copies it into the codebase, things get missed. A script calling the API feels like the obvious fix. It works at first, but the same kinds of problems keep showing up.

Slack message: Hey @channel, why are all the Norwegian translations missing on the new address book feature?

Your product starts speaking with two voices

You translate your settings page. Dashboard becomes Instrumentpanel in Swedish. Weeks later, someone else translates the navigation. The LLM picks Kontrollpanel for the same concept. Both are valid translations. Both are completely inconsistent with each other.

This is the glossary problem. Every product has terms that should be translated the same way every time. Feature names, UI concepts, domain-specific words. The LLM doesn't remember what it said last time. It doesn't know your team decided Dashboard is always Instrumentpanel.

And it's not just a list of terms. Some terms should never be translated. Others need different translations per language: Dashboard stays as Dashboard in German but becomes Instrumentpanel in Swedish. Some need context about how they should be used, like this is our brand name for the feature, keep it friendly and informal. A glossary without that kind of nuance doesn't actually solve the problem.

So now you're building glossary management and style guides. Storing terms with per-language overrides and usage notes, injecting them into prompts, keeping them in sync. You're not working on your product anymore. You're maintaining a terminology database.

Your team can't see what changed

Translations land in a JSON file in a pull request. Your product manager wants to review and tweak a thing before shipping. You point them at the diff on GitHub. They see key-value pairs in a language they may not speak, with no indication of what the source string was. If they're reviewing five languages, they're looking at five diffs per key.

What they actually need is a way to see the source string next to each translation, tweak the ones that don't sound right, and push the changes back to the pull request. Even for languages they don't speak, they need to be able to quickly make adjustments that fit the brand. And when someone does make an adjustment, how does that get back into the PR without a developer manually copying strings?

So now you're building a review interface with sync back to GitHub. You're not working on your product anymore. You're building a translation review tool for people who don't write code.

Reviewing and updating a translation on a pull request, then syncing the change back to GitHub with one click.

Quality degrades silently

This one is easy to miss. Modern LLMs do a good job at basic translations. The problem is consistency: maybe 75% of the time it gets it right, but the other 25% you get an interpolation variable translated ({errorMessage} becomes {felmeddelande}), formal and informal tone mixed in the same product, your brand name inflected, or phrasing that just sounds like AI wrote it: em dashes everywhere, overly polite wording, unnaturally long sentences. None of these crash your app. They just erode trust.

So now you're building quality validation and an eval pipeline to catch regressions. Checking variables, format strings, plural forms, tone consistency, string length. You're not working on your product anymore. You're building a translation QA system.

Quality Insights: catching drift, inconsistencies, and errors in both AI and human translations.

Agents multiply the problem

AI coding agents are writing more software, faster. In a recent interview, Anthropic CEO Dario Amodei confirmed that AI now writes the vast majority of code at Anthropic, a prediction he'd made months earlier that has since played out across many teams. When agents generate features, they generate user-facing strings. More strings, more frequently, across more pull requests.

These agents have zero localization context. They don't know your glossary. They don't know your style conventions. They don't know that Dashboard should be Instrumentpanel and not Kontrollpanel. And they produce strings at machine speed, which means terminology drift happens at machine speed too.

We built an agent skill for Claude Code that gives coding agents access to your project's glossary, style guide, and LLM translation workflow. When an agent writes a new feature, it already knows how your product speaks.

Using the Localhero.ai skill in Claude Code to manage translations while building a feature.

A process that works when one developer remembers to run a script won't survive a workflow where agents generate features continuously. You're not working on your product anymore. You're building agent-aware localization infrastructure.

The pattern

You started by calling an API. Then you needed:

Glossary management to keep terminology consistent
Style guides to maintain the right tone per language
A review interface for non-technical team members
Quality validation to catch silent failures
Git integration to fit your development workflow
Agent-aware localization to keep your AI coding tools consistent with your translation conventions

Each one seems small enough to justify on its own. Together they're a localization workflow, one that pulls your attention from the thing your team actually ships. And unlike your product, nobody's excited about maintaining the translation pipeline. It's the kind of internal tooling that works until it doesn't, and then it's your problem on a Friday afternoon.

Focus on what makes your beer taste better

Building Localhero.ai taught us something about this. The raw AI translation capability is converging toward commodity. Anyone can call the API and get a decent result back. But decent and production-ready in your project's context are very different things. Getting consistently high-quality results takes careful prompt engineering, injecting the right context per project, and a lot of evaluation work. At Localhero.ai, every translation call gets language-specific prompt rules (formal Sie-form in German, appropriate politeness levels in Japanese, natural phrasing in Scandinavian languages), your project's glossary and style guide injected into context, and a confidence score that flags anything below threshold for review.

Jeff Bezos made this point at Y Combinator back in 2008. Early European breweries spent enormous effort building their own electricity generation. The ones that won were the breweries that came later, when power was a utility, and could focus entirely on what made their beer taste better. His point was: focus on what makes your beer taste better. Basic translation is heading the same way.

What's not a commodity is the operational layer that makes translations actually work well inside a product team:

A glossary that grows with your project. Terms your team decided on over time, enforced automatically on every translation, across every language and every project.
Style guides and language-aware defaults. You set your project's tone and brand voice once. The system layers on language-specific conventions automatically: formal Sie-form in German, appropriate politeness levels in Japanese, natural phrasing in Scandinavian languages. You don't configure each language individually.
A review workflow people actually use. Your product manager, content lead, or native-speaking colleague reviews in a UI built for them. Not a JSON diff. Not a Slack thread asking what does this key mean?
Git integration that stays out of the way. Translations appear in the PR, ready to review, without developers thinking about it.
Quality that improves with use. Every translation gets a confidence score. High-scoring translations become reference examples for future ones, matched by semantic similarity to the content being translated. The more you translate, the better context the system draws from.

In our experience, this is the layer that makes localization actually work. The system around the API call that makes things just work.

When to build, when to buy

Every team's situation is different. But if you're a product team shipping multiple languages, translation plumbing is not what makes your beer taste better. You want a solid process that handles terminology, review, quality, and deployment without your team thinking about it.

That's what we built Localhero.ai to do.