LTEngine - LLM powered local machine translation

Hey all :waving_hand:

I’m announcing a new project I’ve been working on called LTEngine.

It aims to offer a simple and easy way for running local and offline machine translation via large language models while offering a LibreTranslate API compatible server. It’s currently powered by LLAMA.cpp (via Rust bindings) running a variety of quantized Gemma3 models. The largest model (gemma3-27b) can fit on a consumer RTX 3090 with 24G of VRAM whereas smaller models can still run at decent speeds on a CPU only.

The software compiles to a single, cross-compatible, statically linked binary which includes everything. I’ve currently tested the software on macOS and Windows, will probably check Linux off the list in the upcoming days.

To build and run:

Requirements:

  • Rust
  • clang
  • CMake
  • A C++ compiler (g++, MSVC) for building the llama.cpp bindings

Steps:

git clone https://github.com/LibreTranslate/LTEngine --recursive
cd LTEngine
cargo build [--features cuda,vulkan,metal] --release
./target/release/ltengine

In my preliminary testing for English <=> Italian (which I can evaluate as a native speaker), the 12B and 27B models perform just as good or outperform DeepL on a variety of inputs, but obviously this is not conclusive and I’m releasing this first version early to encourage early feedback and testing.

The main drawback of this project compared to the current implementation of LibreTranslate is speed and memory usage. Since the models are much larger compared to the lightweight transformer models of argos-translate, inference time takes a while and memory requirements are much higher. I don’t think this will replace LibreTranslate, but rather offer a tradeoff between speed and quality. I think it will mostly be deployed in local, closed environments rather than being offered publicly on internet-facing servers.

The project uses the Gemma3 family of LLM models, but people can experiment with other language models like Llama or Qwen, so long as they work with llama.cpp they will work with LTEngine.

Look forward to hear your feedback :pray:

2 Likes

I wish more software was just a statically linked executable that you can run.

2 Likes

Hi, I installed LT Engine and am testing it. It’s really cool. Please tell me if format: “html” in API will work?

Currently not. It’s on the TODO list.

1 Like

I’ve been testing LTEngine for a couple of days now and translating from English to Ukrainian. Gemma3-12b gives a good translation, but worse than Deepl. Gemma3-27b (I tested it online, because my video card is too small for this model) is on par with Deepl, and sometimes better. How do you achieve maximum speed to translate thousands of texts per day? Should I install a new 3090 24gb or 4090 24gb video card?

2 Likes

24GB will let you run Gemma3-27b locally. I think the biggest bottleneck is currently the mutex lock set here LTEngine/ltengine/src/llm.rs at main · LibreTranslate/LTEngine · GitHub, I haven’t had time to dig into the llama implementation to understand why parallel contexts cannot be run simultaneously, but I would think that removing the lock would allow us to run more translations / unit of time.

1 Like

That would be great. You are creating the best translator ever. Do you have a donation page for the project?

2 Likes

You can support the project financially by getting yourself an API key at https://portal.libretranslate.com and/or support upstream libraries like argos-translate by visiting Sponsor @argosopentech on GitHub Sponsors · GitHub

2 Likes

A new gemma3n model has been released, I checked the translation (aistudio.google.com) and it’s really cool, in my case gemma3n:e4b is better than gemma3:12b. But gemma3:27b is still the best.

UPDATE:
Is it possible to connect it in LTEngine?

2 Likes

The gemma3n model is multimodal too so you could potentially use it for audio/video translation for street signs, menus, live conversations translated on device, and more.

Audio understanding: Introducing speech to text and translation

Gemma 3n uses an advanced audio encoder based on the Universal Speech Model (USM). The encoder generates a token for every 160ms of audio (about 6 tokens per second), which are then integrated as input to the language model, providing a granular representation of the sound context.

This integrated audio capability unlocks key features for on-device development, including:

  • Automatic Speech Recognition (ASR): Enable high-quality speech-to-text transcription directly on the device.

  • Automatic Speech Translation (AST): Translate spoken language into text in another language.

We’ve observed particularly strong AST results for translation between English and Spanish, French, Italian, and Portuguese, offering great potential for developers targeting applications in these languages. For tasks like speech translation, leveraging Chain-of-Thought prompting can significantly enhance results. Here’s an example:

<bos><start_of_turn>user
Transcribe the following speech segment in Spanish, then translate it into English: 
<start_of_audio><end_of_turn>
<start_of_turn>model

Plain text

At launch time, the Gemma 3n encoder is implemented to process audio clips up to 30 seconds. However, this is not a fundamental limitation. The underlying audio encoder is a streaming encoder, capable of processing arbitrarily long audios with additional long form audio training. Follow-up implementations will unlock low-latency, long streaming applications.

1 Like

If it works with ollama, it should work with LTEngine.

You can use the --model-file /path/to/model.gguf flag to choose a custom model. We could add the model to the list here too LTEngine/ltengine/src/models.rs at main · LibreTranslate/LTEngine · GitHub

2 Likes

It’s also been proposed to allow LTEngine to use ollama directly via API, which is an interesting idea: External model support · Issue #3 · LibreTranslate/LTEngine · GitHub

1 Like

I try ./target/release/ltengine --model-file C:/Users/.../gemma-3n-E4B-it-Q4_0.gguf

And have error Failed to initialize LLM: Unable to load model

A bit more output:

llama_model_loader: - type  f32:  422 tensors
llama_model_loader: - type  f16:  108 tensors
llama_model_loader: - type q4_0:  316 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 3.80 GiB (4.75 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma3n'
llama_model_load_from_file_impl: failed to load model
Failed to initialize LLM: Unable to load model

I was able to run it after updating the llama.cpp library and Rust bindings (Update llama bindings to run gemma3n · LibreTranslate/LTEngine@be84c78 · GitHub)

I’ve run the gemma-3n-E4B-it-Q4_0.gguf. Perhaps some testing could be useful to find the most performant model, then we can include it in the list of supported models?

To update just re-clone the repo and rebuild.

2 Likes

I really would love to try this out.

However, it does not build, look:


root@ltengine:/ltengine/LTEngine # cargo build [--features cuda,vulkan,metal] --release
error: unexpected argument '[--features' found

Usage: cargo build [OPTIONS]

For more information, try '--help'.
root@ltengine:/ltengine/LTEngine #

When I try

root@ltengine:/ltengine/LTEngine # cargo build --release
   Compiling llama-cpp-sys-2 v0.1.109 (/ltengine/LTEngine/llama-cpp-rs/llama-cpp-sys-2)
   Compiling actix-web v4.10.2
   Compiling ureq v2.12.1
error: failed to run custom build command for `llama-cpp-sys-2 v0.1.109 (/ltengine/LTEngine/llama-cpp-rs/llama-cpp-sys-2)`

Caused by:
  process didn't exit successfully: `/ltengine/LTEngine/target/release/build/llama-cpp-sys-2-a8a2cd18eb0f9860/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-changed=build.rs

  --- stderr

  thread 'main' panicked at llama-cpp-rs/llama-cpp-sys-2/build.rs:176:46:
  Failed to parse target os x86_64-unknown-freebsd
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
warning: build failed, waiting for other jobs to finish...
root@ltengine:/ltengine/LTEngine #

this also fails.
What could I be doing wrong?

Edit: Maybe it is system-dependent? I have no idea of rust… don’t have windows neither Macs.
Any experiences running LTEngine on Linux? And maybe which distro to try?

Edit2: A Docker image would be awesome… Rust seems not much easier to manage than Python… your LibreTranslate Docker image converted the Python pain into a mere fire-and-forget :slight_smile: Thanks again for all that!

Edit 3: Now I am asking myself whether llama-cpp-sys-2 actually supports FreeBSD. See here. Looks like I have to setup another Linux box.

Edit 4: I have now set up an Ubuntu Server. Only thing, I am stuck.
When attempting to do the cargo build thing, Rust complains that it wants pkg-config.
After doing carrgo add pkg-config Rust complains that it wants openssl.
So, afer running apt install openssl, Rust ignores all these and continues complaining.

Any idea how to get LTEngine installed on an Ubunto server machine?

IDK if this is your issue but you normall install pkg-config as a system package not using cargo. I haven’t done any Rust development but pkg-config is a common Linux package.

sudo apt-get install pkg-config openssl

I don’t know if anyone has gotten LTEngine to run on FreeBSD but it would be a cool project to figure it out.

I’ve only tested LTEngine on macOS and Windows, so there might be a few things to fix with Linux. Do contribute a PR if you end up making changes to compile on Linux or FreeBSD? :folded_hands:

1 Like

Turned out that to enable Rust to connect to the openssl lib (at least on Ubuntu) it is necessary to also install the package libssl-dev: apt install libssl-dev

After doing this, cargo build --release worked just fine.

I had to leave out “[–features cuda,vulkan,metal]” as this caused cargo to croak.
Looks like I misinterpreted this feature thing, so this only should be applied according to the given GPU hardware, if any present?

At least without GPU, The translation speed is muuuuuuuuuch slower than with LibreTranslate. But I like the way it translates!

Does anybody know how much faster such a GPU is, in comparison to CPU-only LTEngine translating?

Because, for me invest cost as well as operation cost is also relevant. My test server has only CPU, 192GB RAM and draws 71 watts, and it cost only a fraction of a 3090 card. No idea how much energy a 3090 draws when waiting for its next task. Energy prices are sky high here, not good for low budget startups.

So I am still trying to figure out which CPU/GPU combination may be the optimum for a given translation volume/capacity need, economically as ecologically.
The important thing is just that the translation machine does not congested, so that even after high volume days the translation queue is empty before the maximum acceptable wait time expired.
Anything more than this would be a waste, economically as ecologically, in my use case.

This all, LibreTranslate for the “insta-translate”, and LTEngine for the overnight “fine-print-translate”, this combines marvellously!

Will now have to check out whether it is possible on Linux to nice LibreTranslate to maximum priority for near-real-time translations while nicing LTEngine down to minimum priority, so it never causes LibreTranslate to wait, so I can run them both on one physical server.

BIG THANKS to you all, particularly @pierotofy and @argosopentech !
The stuff you made, it is just awesome!

Edit:
It was mentioned that LTEngine can run only one task a time… I don’t think this is an issue at all if it is being used as sort of “fine translator” that runs in the background digesting the queue of incoming data.

1 Like

This all, LibreTranslate for the “insta-translate”, and LTEngine for the overnight “fine-print-translate”, this combines marvellously!

It was mentioned that LTEngine can run only one task a time… I don’t think this is an issue at all if it is being used as sort of “fine translator” that runs in the background digesting the queue of incoming data.

LTEngine uses LLMs for translation which is an entirely different system than LibreTranslate which uses the Argos Translate translation engine. Argos Translate uses smaller specialized neural networks for each language so it’s faster. The LLMs used by LTEngine require a lot more compute but can potentially provide more intelligent translations.

From the LTEngine README:

The LLMs in LTEngine are much larger than the lightweight transformer models in LibreTranslate. Thus memory usage and speed are traded off for quality of outputs, which for some languages has been reported as being on par or better than DeepL.

1 Like

LibreTranslate doesn’t actually benefit that much from GPUs in a lot of cases. So for someone in your position I would recommend against a GPU for LibreTranslate since they’re very expensive and power hungry.

LTEngine would probably benefit more from a GPU if you’re doing a lot of translations. Maybe see how far you can get with a CPU server (it sounds like you have a pretty powerful server already) and then upgrade if you need to later? If you’re using LTEngine to process a queue of translations in the background like you said then speed isn’t critical as long as you have enough throughput to meet your requirements.

1 Like