LTEngine - LLM powered local machine translation

pierotofy · May 11, 2025, 5:30pm

Hey all

I’m announcing a new project I’ve been working on called LTEngine.

It aims to offer a simple and easy way for running local and offline machine translation via large language models while offering a LibreTranslate API compatible server. It’s currently powered by LLAMA.cpp (via Rust bindings) running a variety of quantized Gemma3 models. The largest model (gemma3-27b) can fit on a consumer RTX 3090 with 24G of VRAM whereas smaller models can still run at decent speeds on a CPU only.

The software compiles to a single, cross-compatible, statically linked binary which includes everything. I’ve currently tested the software on macOS and Windows, will probably check Linux off the list in the upcoming days.

To build and run:

Requirements:

Rust
clang
CMake
A C++ compiler (g++, MSVC) for building the llama.cpp bindings

Steps:

git clone https://github.com/LibreTranslate/LTEngine --recursive
cd LTEngine
cargo build [--features cuda,vulkan,metal] --release
./target/release/ltengine

In my preliminary testing for English <=> Italian (which I can evaluate as a native speaker), the 12B and 27B models perform just as good or outperform DeepL on a variety of inputs, but obviously this is not conclusive and I’m releasing this first version early to encourage early feedback and testing.

The main drawback of this project compared to the current implementation of LibreTranslate is speed and memory usage. Since the models are much larger compared to the lightweight transformer models of argos-translate, inference time takes a while and memory requirements are much higher. I don’t think this will replace LibreTranslate, but rather offer a tradeoff between speed and quality. I think it will mostly be deployed in local, closed environments rather than being offered publicly on internet-facing servers.

The project uses the Gemma3 family of LLM models, but people can experiment with other language models like Llama or Qwen, so long as they work with llama.cpp they will work with LTEngine.

Look forward to hear your feedback

argosopentech · May 14, 2025, 3:58pm

I wish more software was just a statically linked executable that you can run.

bilbo · June 22, 2025, 6:59pm

Hi, I installed LT Engine and am testing it. It’s really cool. Please tell me if format: “html” in API will work?

pierotofy · June 22, 2025, 7:24pm

Currently not. It’s on the TODO list.

bilbo · June 25, 2025, 12:45pm

I’ve been testing LTEngine for a couple of days now and translating from English to Ukrainian. Gemma3-12b gives a good translation, but worse than Deepl. Gemma3-27b (I tested it online, because my video card is too small for this model) is on par with Deepl, and sometimes better. How do you achieve maximum speed to translate thousands of texts per day? Should I install a new 3090 24gb or 4090 24gb video card?

pierotofy · June 25, 2025, 2:49pm

24GB will let you run Gemma3-27b locally. I think the biggest bottleneck is currently the mutex lock set here LTEngine/ltengine/src/llm.rs at main · LibreTranslate/LTEngine · GitHub, I haven’t had time to dig into the llama implementation to understand why parallel contexts cannot be run simultaneously, but I would think that removing the lock would allow us to run more translations / unit of time.

bilbo · June 25, 2025, 3:20pm

That would be great. You are creating the best translator ever. Do you have a donation page for the project?

pierotofy · June 25, 2025, 3:36pm

You can support the project financially by getting yourself an API key at https://portal.libretranslate.com and/or support upstream libraries like argos-translate by visiting Sponsor @argosopentech on GitHub Sponsors · GitHub

bilbo · June 26, 2025, 8:22pm

A new gemma3n model has been released, I checked the translation (aistudio.google.com) and it’s really cool, in my case gemma3n:e4b is better than gemma3:12b. But gemma3:27b is still the best.

UPDATE:
Is it possible to connect it in LTEngine?

argosopentech · June 26, 2025, 10:26pm

The gemma3n model is multimodal too so you could potentially use it for audio/video translation for street signs, menus, live conversations translated on device, and more.

Audio understanding: Introducing speech to text and translation

Gemma 3n uses an advanced audio encoder based on the Universal Speech Model (USM). The encoder generates a token for every 160ms of audio (about 6 tokens per second), which are then integrated as input to the language model, providing a granular representation of the sound context.

This integrated audio capability unlocks key features for on-device development, including:

Automatic Speech Recognition (ASR): Enable high-quality speech-to-text transcription directly on the device.
Automatic Speech Translation (AST): Translate spoken language into text in another language.

We’ve observed particularly strong AST results for translation between English and Spanish, French, Italian, and Portuguese, offering great potential for developers targeting applications in these languages. For tasks like speech translation, leveraging Chain-of-Thought prompting can significantly enhance results. Here’s an example:

<bos><start_of_turn>user
Transcribe the following speech segment in Spanish, then translate it into English: 
<start_of_audio><end_of_turn>
<start_of_turn>model

Plain text

At launch time, the Gemma 3n encoder is implemented to process audio clips up to 30 seconds. However, this is not a fundamental limitation. The underlying audio encoder is a streaming encoder, capable of processing arbitrarily long audios with additional long form audio training. Follow-up implementations will unlock low-latency, long streaming applications.

pierotofy · June 27, 2025, 6:31am

If it works with ollama, it should work with LTEngine.

You can use the --model-file /path/to/model.gguf flag to choose a custom model. We could add the model to the list here too LTEngine/ltengine/src/models.rs at main · LibreTranslate/LTEngine · GitHub

pierotofy · June 27, 2025, 11:44am

It’s also been proposed to allow LTEngine to use ollama directly via API, which is an interesting idea: External model support · Issue #3 · LibreTranslate/LTEngine · GitHub

bilbo · June 27, 2025, 12:00pm

I try ./target/release/ltengine --model-file C:/Users/.../gemma-3n-E4B-it-Q4_0.gguf

And have error Failed to initialize LLM: Unable to load model

pierotofy · June 27, 2025, 2:32pm

A bit more output:

llama_model_loader: - type  f32:  422 tensors
llama_model_loader: - type  f16:  108 tensors
llama_model_loader: - type q4_0:  316 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 3.80 GiB (4.75 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma3n'
llama_model_load_from_file_impl: failed to load model
Failed to initialize LLM: Unable to load model

I was able to run it after updating the llama.cpp library and Rust bindings (Update llama bindings to run gemma3n · LibreTranslate/LTEngine@be84c78 · GitHub)

I’ve run the gemma-3n-E4B-it-Q4_0.gguf. Perhaps some testing could be useful to find the most performant model, then we can include it in the list of supported models?

To update just re-clone the repo and rebuild.