Article by Olya Sirkin

Imagine you’re translating a medical document for a family member. Or maybe it’s a contract, a private message, a journal entry. You paste it into a cloud translation API and magically, something happens. But… where did your input go? Who read it? For how long is it stored?

Turns out the privacy angle isn’t theoretical. Cloud translation services handle inputs under a wide variety of Terms. Free tiers don’t offer the same declared protections as their paid APIs. Offline translations, on the other hand, eliminate entire categories of compliance burden: no data processing agreements, no audits, no cross-border transfer concerns.

But it turns out that translation that works anywhere, at any time, where your data remains under your control, is not as straightforward an issue as one might think. Still, we decided this was a non-negotiable premise – everything must happen entirely on the device: phones, laptops, even embedded hardware. And this is how we went about it.

“Bigger” is not always the answer

The AI world is obsessed with scale right now. The instinct when you need translation is to reach for a large language model – a 7B or 13B parameter general-purpose model that can translate as one of a hundred things it does. It works. Sometimes well, sometimes less well.

But on edge devices, hardware is the real constraint. Even a small 2B translation LLM like Salamandra takes 1.5GB of storage, 12s to load, and 3.6s per sentence on a laptop CPU. If you upgrade to any modest 7B model you immediately jump to 4-14GB and the results are even slower. On a phone, that UX is not acceptable.

Now compare that with a smaller, dedicated translation model for a specific language pair: it’s in the range 20-35MB (!). It loads in 105ms and translates in 10ms per sentence in batch mode – that’s 344x faster than the LLM. And it doesn’t hallucinate because it’s not generating creative text – it’s doing one thing, and doing it well.

The math is simple: if you need English-to-French translation, you don’t need a model that also writes poetry, summarises articles, and role-plays as a pirate.

The landscape

We’re not the only ones thinking this way:

Mozilla AI built Bergamot with EU funding to bring translation directly into Firefox – no need for extensions or cloud, just WebAssembly running in your browser
Google’s offline translation packs are 35-45MB each and serve billions of users
Apple’s on-device translation works across their native apps without sending text to a server

The trend is clear: translation is moving to the edge. The models are small enough. The hardware is fast enough. The only question is how to get there as a developer without building everything from scratch.

That’s where we want QVAC to help! For translation specifically, QVAC wraps dedicated neural machine translation (NMT) engines (small, fast models built for this job) and exposes them through a clean SDK. You pick a language pair, load the model, and translate. Everything stays on your device.

Here’s a simple example using English to French with the Bergamot engine:

import { loadModel, translate, unloadModel, BERGAMOT_EN_FR } from "@qvac/sdk"

const modelId = await loadModel({
  modelSrc: BERGAMOT_EN_FR,
  modelType: "nmt",
  modelConfig: {
    engine: "Bergamot",
    from: "en",
    to: "fr",
    beamsize: 1,
    temperature: 0.2,
  },
  onProgress: (progress) => {
    console.log(progress)
  },
})

const result = translate({
  modelId,
  text: "This is a test of the Bergamot translation model.",
  modelType: "nmt",
  stream: false,
})

const translatedText = await result.text
console.log(translatedText)
// → "C'est un test du modèle de traduction Bergamot."

await unloadModel({ modelId })

That’s it. The model downloads once, caches locally, and every subsequent call is fully offline. No network request, no third party ever sees your text.

“Yes but I want to translate a batch of sentences at once”

Worry not, just pass an array instead:

const result = translate({
  modelId,
  text: [
    "Hello world",
    "How are you today?",
    "The weather is nice",
  ],
  modelType: "nmt",
  stream: false,
})

const translatedText = await result.text
// → "Bonjour le monde"
// → "Comment allez-vous aujourd'hui?"
// → "Le temps est beau"

What we learned solving this problem

We didn’t start with Bergamot. Our first translation models were Opus-MT (Marian-based), and they worked – but they were larger and slower than what we wanted to deliver on mobile devices. Opus also had a coverage problem: if we needed a language pair that didn’t exist, we’d have to train the model ourselves. This quickly became a huge investment for each new pair. When we found Bergamot, which is smaller, faster, and has broader language coverage out of the box, the switch was obvious.

That said, QVAC also supports LLM-based translation – and it’s not just a fallback: it’s a strategy. When we need to add a new language pair fast, we can ship it immediately using a larger LLM model while training a smaller dedicated NMT model in parallel. Users get support on day one, and the experience gets faster and leaner over time as the dedicated model replaces the LLM.

One thing we didn’t prioritize early enough was batch translation. Translating sentences one at a time is fine for a demo, but in production you’re often handling paragraphs, chat histories, or entire documents. When we added batch support, the speed improvement was dramatic – 2.5x faster throughput at 100 sentences, with per-sentence latency dropping from 26ms to 10ms. If you’re building anything beyond a single-input text box, batch is a must-have.

If we were starting over today, the first thing we’d do is benchmark accuracy and speed on actual target devices before committing to any model architecture. We got lucky that Bergamot turned out to be the right call, but we could have saved time by validating that assumption up front.

Solving for all languages

Now, let’s assume we want to support 26 languages. Each language pair is unidirectional, which means that to cover all the possibilities of English to French, you’d need both EN→FR and FR→EN. So, how many language-pair models do we need to load on an app for full coverage? The naive answer is 650 (26² – 26). But that would be absurd – no one’s gonna ship 650 models to a phone!

The solution is pivoting through English. Want Spanish to Italian? Load ES→EN and EN→IT and chain them together. If you’re using the QVAC SDK, you just need a single API call, two models and you can cover any language pair as long as both ends have an English bridge.

With pivot translation, 26 languages need roughly 50 models instead of 650. Each model is ~20 MB. That’s about 1 GB total for full coverage, much less than a single general-purpose LLM and much faster.

import { BERGAMOT_ES_EN, BERGAMOT_EN_IT, loadModel, translate, unloadModel } from "@qvac/sdk"

const modelId = await loadModel({
  modelSrc: BERGAMOT_ES_EN,
  modelType: "nmt",
  modelConfig: {
    engine: "Bergamot",
    from: "es",
    to: "it",
    beamsize: 4,
    temperature: 0.3,
    pivotModel: {
      modelSrc: BERGAMOT_EN_IT,
      beamsize: 4,
      temperature: 0.3,
    },
  },
  onProgress: (progress) => {
    console.log(progress)
  },
})

const spanishText = "Era una mañana soleada cuando María decidió visitar el mercado local."

const result = translate({
  modelId,
  text: spanishText,
  modelType: "nmt",
  stream: false,
})

const italianText = await result.text
console.log(italianText)
// → "Era una mattina di sole quando Maria decise di visitare il mercato locale."

await unloadModel({ modelId })

Summary

Approach	Model Size	Speed	Consistency
Cloud API (Google, DeepL)	N/A (server-side)	50-600ms + latency	High
Salamandrata 2B LLM (Q4)	1.5 GB	~3.6s per sentence	Inconsistent
Opus-MT (per pair)	~300 MB	~100ms	High
Bergamot (per pair)	21-35 MB	~46ms/sentence	High
Bergamot pivot (2 pairs)	~56 MB	~80-100ms	High

On a Linux laptop, Bergamot EN→IT loads in only 105ms, scores 74.25 BLEU against reference translations, and hits 633 tokens/sec in batch mode (roughly 100 sentences in just over a second).

But the real test is a phone. We ran the same benchmark on a Pixel 10 Pro XL (Tensor G5, ARM64) running Bare runtime directly on-device. The model (35 MB) loaded in just 78ms.

Mode	ms/sentence	tokens/sec	Total (100 sentences)
Batch	26.1	870	2.6
Sequential	163.4	139	16.3

Batch mode on the phone is 6.3x faster than sequential. At 870 tokens/sec, you can translate a full page of text in under 3 sec, entirely on-device, with zero network calls – on mobile hardware this is the difference between a responsive UI and a loading spinner.

What’s next

We’re expanding language coverage into Indic languages through IndicTrans (currently supports 26 languages) as well as increased support for the African continent through AfriqueGemma (which currently supports 24 languages). This is being done first through the specialised LLM approach until we get dedicated NMT models, which are currently unavailable for most pairs.

We’re also looking at supporting streaming translation for real-time use cases like live chat and subtitle generation.

Check out the QVAC repo at github.com/tetherto/qvac. Try it. Break it. Tell us everything, complaints, and language pair requests on our community on Discord. Also, share with us if you’re building something where local AI plays a role – we’d love to hear about it!