Which kind of help is needed?
Would also be be great if there where links to some resources here explaining how to train a usable model for a new language and what the greatest difficulties are with this language.
For instructions, check GitHub - LibreTranslate/Locomotive: Toolkit for training/converting LibreTranslate compatible language models đźš‚
Funny fact: Romans did not even have a word for translation. Either you spoke Latin, or you were a Barbarian (which meant “singing bird” in Greek).
I’ve checked the corpora on opus: CCMatrix is actually a mix of latin and romance languages (found italian and spanish, sometimes french).
Once filtered with langdetect (assuming this works…) it may yield half a mil sentence pairs, some more adding the bible (which is already not classic latin) and XLEnt.
Assuming you get half a mil sentences, you should then thread carefully: training may stop at any moment, and before 20k steps (w/ vanilla settings, 6k with DEEP because it also processes more data in a single step), the models are meaningless.
Try first Locomotive’s vanilla settings. If this works, skip next. If not, try to change the learning rate schedule (decay_method: noam, learning_rate: 1, warming_steps: 1000) to slow things down, then lower the LR until training runs for at least 32k steps before stopping.
When you have an operational learning schedule, switch to more robust hyperparameters like the “DEEP” ones (look into other topics on the site, there are several possibilities). Then, training may stop around 10k steps, nevermind: it will have read and learnt the same amount of sentences as before.
Otherwise, you’ll have to train a multilingual model (like lynxpda did for veps) italian+sardinian+latin/english, or maybe roughly tune for italian, then reset optim and update vocabs to Latin after 1 or 2000 steps, using as litlle (2M) Italian as possible and avoiding corpora that feature modern concepts. Don’t know, never tried, and probably a pipe dream.
why sardinian? It has a clean NLLB corpus of 1M sentences and is the roman language closest to the ancient Latin language (guys lived on an island with little to plunder and moutains to hide into, so they missed the most of the Middle Ages’ invasions), so it will prevent the model from overfitting to Italian and losing Latin’s grammatic and syntactic specific features.