LTEngine - LLM powered local machine translation

pierotofy · August 15, 2025, 2:22pm

You’ll get a noticeable speed-up using a GPU for LTEngine. (In the 10x range, but depends on the card, LLM model, etc.).

argosopentech · August 15, 2025, 2:23pm

Great! It sounds like you got LTEngine compiling on Linux.

kernschmelze · August 18, 2025, 5:13pm

Yes, it works just fine. There is some probability that on other Linuxes than Ubuntu Server more than only a minimum of libraries are preinstalled, so it might run directly without investigating what is missing.
I still need to write a systemd unit for starting the servers after boot, and then the installation is finished. Will post that for documentation…

NicoLe · August 20, 2025, 7:25pm

If you’re a low budget startup and look for an inexpensive infrastructure, i suggest loading LibreTranslate models on CPU (a single core can handle 4 simultaneous requests with almost no delay) and the LLM for LTEngine on the minimal GPU available : a LLM (except 7B and maybe 14B) will not run well on a CPU whatever the amount of RAM you’ve got.

You can rent a GPU on a monthly basis for one (L40S, 90 GB RAM, 46 GB VRAM) or two (H100, 180GB RAM, 80 GB VRAM) thousand euros, and from my experience, a 192GB RAM server is not much cheaper than a monthly thousand euros. They feature enough CPU cores for LibreTranslate to run well until a few thousand regular users.
As of buying, servers with Ada cards are pretty inexpensive (less than 100k for 4 GPUs), I don’t know about Hopper cards yet (awaiting a quote for H200 servers).

If you want to do testing and development, there is also a workstation GPU that replicates the L40S (or current Ada card), it’s the higher end RTX6(7 now?)000 Ada. It’s about 10k€, so its amortization is very fast, and anything you’ll do with it will be consistent with what will happen on an L40S.
However, you cannot use it for prototyping something that will run on a H100 or H200 since the Hopper architecture has different features than the Ada architecture (we hit a snag on this one last month).

kernschmelze · August 20, 2025, 7:52pm

Yes, it depends on the traffic volume and demands of translation speed.

I am still thinking about the best/most practical ways to measure throughput and throughput reserves, and calculating/estimating the actual/future CPU/GPU needs.
Word+char count and microseconds time usage plus what else?

Just started with the test runs to make sure all things run smoothly.
There were some little issues with the UTF8 encoding (dunno why I need to utf8::decode() it twice with LibreTranslate, and only once with LTEngine etc). But now it runs smoothly.

And I really need to look into the LTEngine source how to add multiple language output of the calculation result, to have not to do a complete recalculation for each languages combo. Too bad I have no clue of Rust, it’s a bit different…
For those who need multiple language translations this would maybe be a great performance boost as well as energy saving.

NicoLe · August 20, 2025, 8:52pm

Timely answer to requests is the most sensible metric: up to 1.6s, one-thousand chars translations flow on LibreTranslate, above this value, the pipeline clogs and unanswered requests start piling up.

Don’t know the corresponding threshold for LTEngine, but it’s probably in the same range (rapid-firing your typical LLM with one three-thousand-words prompt per second for half an hour is ok on an Ada GPU).

tacman · August 22, 2025, 10:21am

Thanks, everyone, for sharing your experiences. This is really exciting.

Is there a way to get an API key somewhere to test and pay for the larger model translation? Given the significant costs of running this, I would like to do a comparison before investing in it. And like others here, we’re a small non-profit startup, we’d need to apply for a grant to do this at scale.

kernschmelze · August 22, 2025, 3:56pm

This all is awesome.

To ease my experimenting I modded LTEngine such that my Perl script can feed it directly the prompt for stuffing Gemma, and directly take the output, without JSON and all that, so there is no need to do the cargo building for testing every variation of the prompt string.

If it is ordered to do translation 1x source → 1x target language, time usage is 100%, compared with translation 1x source → 5 different target languages in one run = 325% time usage.
Less than I expected, but better than full 500% for 5x single language translations.

kernschmelze · August 23, 2025, 1:08pm

I could not sleep much this night, got up early and continued tailoring the instruction string for Gemma.
Turned out, this “AI” is very dumb when it comes to structured work. It is like normal programming, just that there is no specification of the programming “language”. It has a grasp of some popular things like “html tags”, but if you need more, you need to find a form of natural speech regular expression defining. This is very yucky, as the slightest variation in the instruction “code” can turn sensible output into trash, albeit sometimes very funny trash.

Anyway it now

takes a list of languages to translate to, and some other optional parameters soon, like style formal/informal etc, translating or not translating quotes in different languages, etc
keeps untranslated not only HTML tags, but also user-specified tags __<somestring>__. (This is just my first idea how to format tags, not finally decided yet. Not sure yet whether it is possible to pass Perl regexps to Gemma, haven’t had success yet in my experiments)
puts out a cleanly formatted output that can easily be processed by Perl regexps

I am sure this will end in a super super nice translation API, at least for my exotic needs.

However the actual natural text ¨programming" is very sensitive to even the slightest variations not only in the result, but also in regards to computing time/cost.
I think there is the biggest optimization potential, will have to read and learn a lot…

Thank you so much @pierotofy @argosopentech ! You are my de-facto teachers how to deal with LLMs. That stuff is just exciting

kernschmelze · August 23, 2025, 9:52pm

Your role: You are an expert linguist, specializing in translation. You are able to capture the nuances of the languages you translate. You pay attention to masculine/feminine/plural, proper use of articles and grammar, bias, attitude, style and tone from informal to formal. You always provide natural sounding translations that fully preserve the meaning of the original text. In doubt you give accuracy preference over elegance. You never provide explanations for your work. You must preserve all HTML tags and elements in the translation. You always answer with the translations ordered, and nothing else.
Your instructions:
At the end of this document, there is a German text, from directly after the line consisting of 40 equal signs (=) to the end of the document.
It begins with a dot, a whitespace and a tag delimited by double underscores (__).
Keep this and all other similar tags unmodified where they are in the original text, and when reproducing the text, print the tags unmodified where they belong at.
Iterate through this comma-separated list of languages enclosed in the brackets here [English, Russian, Spanish, Polish,Turkish,Arabic], doing this list of tasks enclosed in the brackets here [ Print translation ] in each iteration.
========================================
. __123_22__ Die Sonne scheint __38233__, es ist __566__ und ich bin __234534__. __3776__

Output:


"[English]\n. __123_22__ The sun is shining __38233__, it is
__566__ and I am __234534__. __3776__\n\n[Russian]\n. __123_22__ Солнце светит __38233
__, день __566__ и я __234534__. __3776__\n\n[Spanish]\n. __123_22__ El sol brilla __3
8233__, es __566__ y yo __234534__. __3776__\n\n[Polish]\n. __123_22__ Słońce świeci _
_38233__, jest __566__ i ja __234534__. __3776__\n\n[Turkish]\n. __123_22__ Güneş parl
ıyor __38233__, hava __566__ ve ben __234534__. __3776__\n\n[Arabic]\n. __123_22__ الش
مس تشرق __38233__، إنه __566__ وأنا __234534__. __3776__\n"

Arabic is strange…but whatever.
So far, so good
Attempting alphanum instead of num only results in START and END missing. Not yet succeeded there.
Performance is not so great: with 6 target languages it consumes as much time as 5 normal LTEngine calls, thus only 20% time saving.

Without tag handling:
Your instructions:

At the end of this document, there is a German text, from directly after the line consisting of 40 equal signs (=) to the end of the document.
Iterate through this comma-separated list of languages enclosed in the brackets here [English, Russian, Spanish, Polish, Turkish, Arabic], doing this list of tasks enclosed in the brackets here [ Print translation ] in each iteration.
========================================
Die Sonne scheint rot, es ist warm und ich bin zufrieden.

this only costs 220% instead of 600% for 6 calls of LTEngine for a single language each… almost 3x performance boost

argosopentech · August 25, 2025, 11:44am

I use Vast.ai (my referral code) to rent GPUs hourly for model training. The prices are affordable and it’s worked well for me.

pierotofy · August 25, 2025, 2:36pm

I’ve used vast.ai in the past too, lowest prices.

kernschmelze · August 26, 2025, 12:09pm

So the LTEngine is much more than a “bare” Gemma?
They say Gemma is >140 languages, LTEngine about 40… so LTEngine is sort of “trained Gemma”?

Is there a recommended procedure to switch usage to another .gguf, like from default 4B to larger, or from Gemma3 to 4, or to Mistral etc?

Just curious whether I can train LTEngine by for example feed it all the texts and the data I can find about a particular topic?

Sorry for my stupid questions… I am completely new into this, this is so exciting… thank you for your patience with me.

Edit: Maybe you know some good primer reads into that topic how to do such with LLMs via API?

pierotofy · August 26, 2025, 2:43pm

LTEngine enables a subset of languages (but you can enable/test more by adding the language you need to LTEngine/ltengine/src/languages.rs at main · LibreTranslate/LTEngine · GitHub ). It’s a matter of editing the prompt and verifying that the model can actually translate (don’t assume that it can, despite the claims).

christ3137 · October 1, 2025, 2:43am

First of all, I would like to thank you for LTEngine. I have been translating messages for a website for several years, mainly using Deepl. Some time ago, Deepl began restricting its use for normal translations, which was a shame because I had big plans. Using their API would cost thousands of euros, and with the restrictions, it was no longer possible for me to translate manually because you never know when the limit will be reached. A very annoying practice. Due to these circumstances, I looked for other solutions and came across libretranslate, LTEngine and gemma. I bought a 3090 to run the 27b version of gemma. I was very happy with the results. So I went a step further and translated texts into Bengali, Korean, Japanese and Arabic, for example, when I noticed some strange behavior. Sometimes words were not translated or were translated with foreign words, sometimes entire sentences remained in English. I tested with the parameters and context size, but the problems persisted. So I looked for other models. And I found one called Doplhin Mistral Venice Edition 24b. At first I was sceptical, but the results are so good that I wanted to share this experience here. I’ve never used LTEngine, but rather the llama.cpp server directly, but it could still be useful. Not only is the quality of the results better and much better for more exotic languages, it also requires less VRAM and has a higher inference speed, especially when running multiple threads in parallel. With Gemma3 27b 5bit, I was able to achieve about 80 tokens per second with 7 parallel threads, and with Venice 27b 6bit, I can achieve 210 tokens per second with 15 parallel threads and still have more VRAM free than with Gemma. I just wanted to share this, in case anyone else is interested. The translation quality of this model is truly excellent. I’m so glad I found it. It was only thanks to you that I was able to find Gemma and translate hundreds of megabytes of text. Now it’s on to Venice

kernschmelze · October 1, 2025, 11:08am

# note that running this model on GPU requires over 60 GB of GPU RAM

where? how much GPU RAM does it need?

bilbo · October 3, 2025, 5:56pm

Did you run LTEngine in 7 threads?

kernschmelze · October 4, 2025, 3:38pm

He said he used llama.cpp directly.
Via http server?
Or via some server interface bindings?

You can ask the LLM to regurgitate the stuff you stuffed it with in different languages, specified by a list, maybe it is this? It is also more effective as there is no redundant loading of the model.

christ3137 · October 4, 2025, 7:39pm

Yes, I use llama-server.exe from the llama.cpp releases. This starts an OpenAI-compatible chat server. On the other side, I use a PHP script with GuzzleHttp\Pool, which allows to set the number of concurrent requests. llama-server has these settings:

Gemma3:

.\llama-server.exe -m D:\models\gemma-3-27b-it-qat-Q5_K_S.gguf -c 21504 -np 8 -ngl 99 --seed 3407 --prio 2 --temp 0.153 --repeat-penalty 1.2 --min-p 0.00 --top-k 40 --top-p 0.95 --host 0.0.0.0 --port 5000 --numa distribute --rope-scaling linear -mg 0 -sm none -fa on -ctv q8_0 -ctk q8_0

Dolphin Mistral:

.\llama-server.exe -m D:\models\cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-Q6_K.gguf -c 45000 -np 15 -ngl 99 --seed 3407 --prio 2 --temp 0.155 --repeat-penalty 1.2 --min-p 0.00 --top-k 40 --top-p 0.95 --host 0.0.0.0 --port 5000 --numa distribute --rope-scaling linear -mg 0 -sm none -fa on -ctv q8_0 -ctk q8_0

With Gemma, there are actually 8 threads, sorry for the mistake. With more threads, the card’s performance decreases.

I noticed that Mistral also has problems with exotic languages such as Thai, Malayalam, Tamil, Telugu, and similar. The model then tends to produce a token explosion. In most cases, a higher temperature and a smaller context help to translate the message; in one case, only switching to gemma3 helped.

Copilot also suggests Qwen3-30B-A3B-Instruct-2507 for translations. I wanted to test that soon.

The performance of Dolphin Mistral is nevertheless great. Approximately 14 tokens per second per thread, or approximately 210 tokens for all 15 threads. The main problem for me is the cooling of the RAM modules. If that were better, it might be possible to get even more out of it.

bilbo · October 4, 2025, 7:49pm

Can you please write the specifications of your computer?