Engineering Translation
Print
Local translation: when small dedicated models beat Goliath

Turns out the privacy angle isn’t theoretical. Cloud translation services handle inputs under a wide variety of Terms. Free tiers don’t offer the same declared protections as their paid APIs. Offline translations, on the other hand, eliminate entire categories of compliance burden: no data processing agreements, no audits, no cross-border transfer concerns.

“Bigger” is not always the answer

The landscape

We’re not the only ones thinking this way:

import { loadModel, translate, unloadModel, BERGAMOT_EN_FR } from "@qvac/sdk"

const modelId = await loadModel({
  modelSrc: BERGAMOT_EN_FR,
  modelType: "nmt",
  modelConfig: {
    engine: "Bergamot",
    from: "en",
    to: "fr",
    beamsize: 1,
    temperature: 0.2,
  },
  onProgress: (progress) => {
    console.log(progress)
  },
})

const result = translate({
  modelId,
  text: "This is a test of the Bergamot translation model.",
  modelType: "nmt",
  stream: false,
})

const translatedText = await result.text
console.log(translatedText)
// → "C'est un test du modèle de traduction Bergamot."

await unloadModel({ modelId })

That’s it. The model downloads once, caches locally, and every subsequent call is fully offline. No network request, no third party ever sees your text.

“Yes but I want to translate a batch of sentences at once”

Worry not, just pass an array instead:

const result = translate({
  modelId,
  text: [
    "Hello world",
    "How are you today?",
    "The weather is nice",
  ],
  modelType: "nmt",
  stream: false,
})

const translatedText = await result.text
// → "Bonjour le monde"
// → "Comment allez-vous aujourd'hui?"
// → "Le temps est beau"

What we learned solving this problem

We didn’t start with Bergamot. Our first translation models were Opus-MT (Marian-based), and they worked – but they were larger and slower than what we wanted to deliver on mobile devices. Opus also had a coverage problem: if we needed a language pair that didn’t exist, we’d have to train the model ourselves. This quickly became a huge investment for each new pair. When we found Bergamot, which is smaller, faster, and has broader language coverage out of the box, the switch was obvious.

That said, QVAC also supports LLM-based translation – and it’s not just a fallback: it’s a strategy. When we need to add a new language pair fast, we can ship it immediately using a larger LLM model while training a smaller dedicated NMT model in parallel. Users get support on day one, and the experience gets faster and leaner over time as the dedicated model replaces the LLM.

One thing we didn’t prioritize early enough was batch translation. Translating sentences one at a time is fine for a demo, but in production you’re often handling paragraphs, chat histories, or entire documents. When we added batch support, the speed improvement was dramatic – 2.5x faster throughput at 100 sentences, with per-sentence latency dropping from 26ms to 10ms. If you’re building anything beyond a single-input text box, batch is a must-have.

If we were starting over today, the first thing we’d do is benchmark accuracy and speed on actual target devices before committing to any model architecture. We got lucky that Bergamot turned out to be the right call, but we could have saved time by validating that assumption up front.

Solving for all languages

Now, let’s assume we want to support 26 languages. Each language pair is unidirectional, which means that to cover all the possibilities of English to French, you’d need both EN→FR and FR→EN. So, how many language-pair models do we need to load on an app for full coverage? The naive answer is 650 (262 – 26). But that would be absurd – no one’s gonna ship 650 models to a phone!

The solution is pivoting through English. Want Spanish to Italian? Load ES→EN and EN→IT and chain them together. If you’re using the QVAC SDK, you just need a single API call, two models and you can cover any language pair as long as both ends have an English bridge.

With pivot translation, 26 languages need roughly 50 models instead of 650. Each model is ~20 MB. That’s about 1 GB total for full coverage, much less than a single general-purpose LLM and much faster.

import { BERGAMOT_ES_EN, BERGAMOT_EN_IT, loadModel, translate, unloadModel } from "@qvac/sdk"

const modelId = await loadModel({
  modelSrc: BERGAMOT_ES_EN,
  modelType: "nmt",
  modelConfig: {
    engine: "Bergamot",
    from: "es",
    to: "it",
    beamsize: 4,
    temperature: 0.3,
    pivotModel: {
      modelSrc: BERGAMOT_EN_IT,
      beamsize: 4,
      temperature: 0.3,
    },
  },
  onProgress: (progress) => {
    console.log(progress)
  },
})

const spanishText = "Era una mañana soleada cuando María decidió visitar el mercado local."

const result = translate({
  modelId,
  text: spanishText,
  modelType: "nmt",
  stream: false,
})

const italianText = await result.text
console.log(italianText)
// → "Era una mattina di sole quando Maria decise di visitare il mercato locale."

await unloadModel({ modelId })

Summary

ApproachModel SizeSpeedConsistency
Cloud API (Google, DeepL)N/A (server-side)50-600ms + latencyHigh
Salamandrata 2B LLM (Q4)1.5 GB~3.6s per sentenceInconsistent
Opus-MT (per pair)~300 MB~100msHigh
Bergamot (per pair)21-35 MB~46ms/sentenceHigh
Bergamot pivot (2 pairs)~56 MB~80-100msHigh

On a Linux laptop, Bergamot EN→IT loads in only 105ms, scores 74.25 BLEU against reference translations, and hits 633 tokens/sec in batch mode (roughly 100 sentences in just over a second).

But the real test is a phone. We ran the same benchmark on a Pixel 10 Pro XL (Tensor G5, ARM64) running Bare runtime directly on-device. The model (35 MB) loaded in just 78ms.

Modems/sentencetokens/secTotal (100 sentences)
Batch26.18702.6
Sequential163.413916.3

Batch mode on the phone is 6.3x faster than sequential. At 870 tokens/sec, you can translate a full page of text in under 3 sec, entirely on-device, with zero network calls – on mobile hardware this is the difference between a responsive UI and a loading spinner.

What’s next

We’re expanding language coverage into Indic languages through IndicTrans (currently supports 26 languages) as well as increased support for the African continent through AfriqueGemma (which currently supports 24 languages). This is being done first through the specialised LLM approach until we get dedicated NMT models, which are currently unavailable for most pairs.

We’re also looking at supporting streaming translation for real-time use cases like live chat and subtitle generation.

Check out the QVAC repo at github.com/tetherto/qvac. Try it. Break it. Tell us everything, complaints, and language pair requests on our community on Discord. Also, share with us if you’re building something where local AI plays a role – we’d love to hear about it!