Right now I have reached a new roadblock.
Without CUDA, translation was slow even on 10 core xeon. So I got myself a RTX 5060 Ti 16GB. Speed is significantly better, so I could do more testing with larger texts and more languages.
But I found there is some hard limit (using Gemma3 12B from the default HF):
if translating to one target language, the maximum text size that LTEngine takes without instantly crashing is between 3.5k and 4.5k, depending on text and languages involved.
if translating to six target languages, it is about 2k.
if translating to nine target languages, it is about 1.5k.
If the text is larger than that, LTEngine crashes.
This is way less I had expected with a context size of 128k.
/llama/LTEngine/llama-cpp-rs/llama-cpp-sys-2/llama.cpp/src/llama-context.cpp:919: GGML_ASSERT(n_tokens_all <= cparams.n_batch) failed
Not enough buffer?
Looking through the llama-cpp-2 documentation, I haven’t yet found any settings for buffer dimensioning etc like they are in the inofficial Google bindings, for example max_out_length documented here: gm.text.ChatSampler — gemma
Maybe there was missed some error information that appeared? Llama::Error - llama main
And, what does causal attention?
Causal attention seems to caps the context size?!? Could such be detrimental for contextual translation quality?
So there must be a way to set the llama-cpp parameter cparams.causal_attn to false when creating the context.
I haven’t yet found how, would like to try what effect it has on translation.
But that small limit of 2-4k text size, is that really the maximum achievable on “128k context window”? Do I expect too much?
I thought it could be possible to translate text chunks of, say, 40-50k characters with 128k tokens??
Any ideas?