It aims to offer a simple and easy way for running local and offline machine translation via large language models while offering a LibreTranslate API compatible server. It’s currently powered by LLAMA.cpp (via Rust bindings) running a variety of quantized Gemma3 models. The largest model (gemma3-27b) can fit on a consumer RTX 3090 with 24G of VRAM whereas smaller models can still run at decent speeds on a CPU only.
The software compiles to a single, cross-compatible, statically linked binary which includes everything. I’ve currently tested the software on macOS and Windows, will probably check Linux off the list in the upcoming days.
In my preliminary testing for English <=> Italian (which I can evaluate as a native speaker), the 12B and 27B models perform just as good or outperform DeepL on a variety of inputs, but obviously this is not conclusive and I’m releasing this first version early to encourage early feedback and testing.
The main drawback of this project compared to the current implementation of LibreTranslate is speed and memory usage. Since the models are much larger compared to the lightweight transformer models of argos-translate, inference time takes a while and memory requirements are much higher. I don’t think this will replace LibreTranslate, but rather offer a tradeoff between speed and quality. I think it will mostly be deployed in local, closed environments rather than being offered publicly on internet-facing servers.
The project uses the Gemma3 family of LLM models, but people can experiment with other language models like Llama or Qwen, so long as they work with llama.cpp they will work with LTEngine.
I’ve been testing LTEngine for a couple of days now and translating from English to Ukrainian. Gemma3-12b gives a good translation, but worse than Deepl. Gemma3-27b (I tested it online, because my video card is too small for this model) is on par with Deepl, and sometimes better. How do you achieve maximum speed to translate thousands of texts per day? Should I install a new 3090 24gb or 4090 24gb video card?
24GB will let you run Gemma3-27b locally. I think the biggest bottleneck is currently the mutex lock set here LTEngine/ltengine/src/llm.rs at main · LibreTranslate/LTEngine · GitHub, I haven’t had time to dig into the llama implementation to understand why parallel contexts cannot be run simultaneously, but I would think that removing the lock would allow us to run more translations / unit of time.
A new gemma3n model has been released, I checked the translation (aistudio.google.com) and it’s really cool, in my case gemma3n:e4b is better than gemma3:12b. But gemma3:27b is still the best.
The gemma3n model is multimodal too so you could potentially use it for audio/video translation for street signs, menus, live conversations translated on device, and more.
Audio understanding: Introducing speech to text and translation
Gemma 3n uses an advanced audio encoder based on the Universal Speech Model (USM). The encoder generates a token for every 160ms of audio (about 6 tokens per second), which are then integrated as input to the language model, providing a granular representation of the sound context.
This integrated audio capability unlocks key features for on-device development, including:
Automatic Speech Recognition (ASR): Enable high-quality speech-to-text transcription directly on the device.
Automatic Speech Translation (AST): Translate spoken language into text in another language.
We’ve observed particularly strong AST results for translation between English and Spanish, French, Italian, and Portuguese, offering great potential for developers targeting applications in these languages. For tasks like speech translation, leveraging Chain-of-Thought prompting can significantly enhance results. Here’s an example:
<bos><start_of_turn>user
Transcribe the following speech segment in Spanish, then translate it into English:
<start_of_audio><end_of_turn>
<start_of_turn>model
Plain text
At launch time, the Gemma 3n encoder is implemented to process audio clips up to 30 seconds. However, this is not a fundamental limitation. The underlying audio encoder is a streaming encoder, capable of processing arbitrarily long audios with additional long form audio training. Follow-up implementations will unlock low-latency, long streaming applications.
I’ve run the gemma-3n-E4B-it-Q4_0.gguf. Perhaps some testing could be useful to find the most performant model, then we can include it in the list of supported models?