LTEngine reliably crashes if input > 4k

Right now I have reached a new roadblock.
Without CUDA, translation was slow even on 10 core xeon. So I got myself a RTX 5060 Ti 16GB. Speed is significantly better, so I could do more testing with larger texts and more languages.

But I found there is some hard limit (using Gemma3 12B from the default HF):
if translating to one target language, the maximum text size that LTEngine takes without instantly crashing is between 3.5k and 4.5k, depending on text and languages involved.
if translating to six target languages, it is about 2k.
if translating to nine target languages, it is about 1.5k.

If the text is larger than that, LTEngine crashes.
This is way less I had expected with a context size of 128k.

/llama/LTEngine/llama-cpp-rs/llama-cpp-sys-2/llama.cpp/src/llama-context.cpp:919: GGML_ASSERT(n_tokens_all <= cparams.n_batch) failed

Not enough buffer?
Looking through the llama-cpp-2 documentation, I haven’t yet found any settings for buffer dimensioning etc like they are in the inofficial Google bindings, for example max_out_length documented here: gm.text.ChatSampler — gemma

Maybe there was missed some error information that appeared? Llama::Error - llama main

And, what does causal attention?
Causal attention seems to caps the context size?!? Could such be detrimental for contextual translation quality?
So there must be a way to set the llama-cpp parameter cparams.causal_attn to false when creating the context.
I haven’t yet found how, would like to try what effect it has on translation.

But that small limit of 2-4k text size, is that really the maximum achievable on “128k context window”? Do I expect too much?
I thought it could be possible to translate text chunks of, say, 40-50k characters with 128k tokens??

Any ideas?

Mm, strange, it could be a bug in llama-cpp perhaps.

The thing is that it is a quick crash, usually less than 0.5 sec after launch… did maybe not all tokens get fed to the model?
I only slowly start to understand all that Rust stuff.
And in the llama sources there is some logging visible, unfortunately I didn’t yet find where the logfiles are kept.
Best workaround probably to partition the text in chunks that can be safely fed to the server and recombine the translated returns afterwards.
So no big issue.

1 Like

LLM context is the raw parameter for the whole conversation thread, meaning questions and answers. So 128k context is actually around 60k text.

Then, other parameters apply too because of multiuser configuration. Any request that overflows said parameters will trigger an exception and crash the app if the latter is not managed.

We launched a multifonctions VLLM-based service for LLM-translation/OCR/synthesis/…, so i cannot help you specifically with llamacpp, but what we do is a two-steps 1. cut paragraphs (\n\n string or equivalent xml tag) or pages when OCR is involved and then 2. feed the translation service with a (short) list of strings that max usually between a few hundreds (short paragraphs) and a coupla thousands (OCRised page) characters.

1 Like