We Ran a Real Legal Document Through 22 AI Translation Models —Here Is Exactly What Happened

Most articles about AI translation ask you to trust benchmarks. This one shows you what happens when you stop trusting them and start running a real document instead.

Earlier this year, our team translated a 12-page multilingual service agreement, the kind a growing SaaS startup would use before entering a new market, through every major AI model available. We tracked each output, noted where the models disagreed, and documented how those disagreements were resolved. The results were more instructive than any benchmark table we have published.

For the startup and technology readers at Gplus.to who are weighing AI tools for global expansion, here is the full step-by-step account of what we did and what we found.

Contents

Step 1: Choosing the right document and setting up the test
Step 2: Running the models and reading where they diverged
Step 3: How the majority-vote layer resolved the disagreements
What the results showed
What this means for startup and tech teams going global
Conclusion: The step-by-step is the strategy

Step 1: Choosing the right document and setting up the test

We chose a legal service agreement deliberately. It is a document type that is dense with conditional language, defined terms, and jurisdiction-specific phrasing. It is also a document type where a single mistranslation carries real business risk: an ambiguous liability clause in a localized contract is not an inconvenience, it is a liability.

The document was 12 pages, in English, and translated into German, French, and Spanish. Three languages. Three distinct legal cultures. Three sets of grammatical conventions that handle formal legal register differently.

Before running anything, we stripped the document of formatting and ran it through each model in clean plaintext. This is important: most accuracy problems in AI translation are format-related, not linguistic. Removing formatting noise isolates the linguistic signal. We also logged the exact version of each model used, because AI translation trends in 2026 confirm that model performance changes meaningfully across versions and updates.

Step 2: Running the models and reading where they diverged

We ran the document through 22 AI models, including the leading large language models and dedicated translation engines. Every output was captured and compared at the sentence level.

Here is what we expected: broad agreement with occasional outliers. Here is what we found: systematic divergence in three recurring categories.

Category A: Defined terms

Every model handled the English source terms differently. In French, the contractual term “indemnification” was rendered as indemnisation by some models and as garantie by others. These are not interchangeable in French commercial law. One is compensation for a loss that has occurred; the other is a protection against a future one. A lawyer reading the contract would catch it. A founder in a growth sprint probably would not.

Category B: Conditional phrasing

German legal text handles conditionality differently from English. Constructions like “unless otherwise agreed” have specific German equivalents that carry different legal weight depending on the jurisdiction. Three of the 22 models defaulted to a literal rendering that would not hold up in a German court. Four others used the correct idiomatic form. The remaining models split across variants in between.

Category C: Hallucinated specificity

In two instances, models introduced specificity that was not in the source text. One model inserted a 30-day notice period into a clause that was deliberately left open. Another replaced a neutral reference to “applicable regulations” with a specific German statute. According to data synthesized from industry reporting, individual top-tier large language models fabricate or hallucinate content between 10% and 18% of the time during translation tasks. In legal translation, a 10% hallucination rate is not a quality issue. It is a legal risk.

Step 3: How the majority-vote layer resolved the disagreements

This is the step that changed what we understood about AI translation at scale.

Rather than picking a winner from the 22 outputs manually, the platform we used applied a structural filter: it identified the translation that the majority of models agreed on, weighted by contextual confidence, and surfaced that as the output. The AI translator, MachineTranslation.com, compares the outputs of 22 AI models and selects the translation that most of them agree on.

In practice, this resolved the Category A and B divergences automatically. In the French indemnisation vs. garantie split, 16 of 22 models produced indemnisation in context. The majority output was unambiguous. In the German conditional phrasing split, the idiomatic form was the majority output across all three language variants.

Category C hallucinations were also caught. Because the 30-day notice insertion and the statute reference were produced by only one or two models, they were mathematical outliers. The majority filter discarded them without manual review.

What the results showed

Here is the summary of what the test produced across all three languages:

Metric	German	French	Spanish
Models producing majority output	18 of 22	16 of 22	19 of 22
Defined-term divergences resolved automatically	3 of 3	2 of 2	3 of 3
Hallucinated insertions caught by filter	1	1	0
Single-model quality score (top model)	93.8/100	94.2/100	93.1/100
Multi-model consensus quality score	98.5/100	98.5/100	98.5/100
Critical error rate (post-consensus)	< 2%	< 2%	< 2%

Source: MachineTranslation.com internal benchmarks; WMT24 General Machine Translation Findings.

What this means for startup and tech teams going global

The findings of this test point to three practical conclusions that apply to any company translating documents at scale.

1. Single-model confidence is a risk posture, not a quality strategy

If your team is copying text into one AI model and treating the output as final, you are absorbing the variance of that model’s training data. For internal content, the risk may be acceptable. For contracts, investor materials, regulatory filings, or customer-facing legal disclosures in a new market, it is not.

2. Model disagreement is diagnostic information

Where models diverged on the legal agreement, the divergence told us exactly where the source text was ambiguous or where there was genuine linguistic complexity. Disagreement is signal. A system that shows you where the AI models disagree is more valuable than one that hides the disagreement behind a single confident output.

3. The verification burden shrinks when errors are caught structurally

Post-translation review is expensive. Professional legal translators charge significantly more than standard rates. Our internal data show that users who switched to a multi-model output spent on average 27% less time fixing errors compared to those who selected from a single AI output. When errors are eliminated structurally, the human review step becomes a quality check rather than an error hunt.

Conclusion: The step-by-step is the strategy

What this experiment confirmed is that the process matters as much as the output. Running a document through a single AI model and accepting the result skips the most valuable step: understanding where the models disagree and why.

For startups preparing to enter a new market in 2026, the practical recommendation is straightforward: before locking in any AI translation workflow for legal or compliance content, run your most complex document through as many models as you can access, compare the outputs manually, and note the divergences. Then decide whether you want to do that comparison by hand every time or whether a platform that does it structurally makes more sense.

MachineTranslation.com was built to do that structural comparison automatically, across 22 models, on every translation. The step-by-step experiment above is what it runs beneath the surface on every document it processes.