The 1.9 French translation model is terrific. Thank you very much for your hard work on it.
Nevertheless, it is quite generic and I would like to specialize some translations to adapt them to the specific vocabulary used in my company. Would it be possible to complete an existing training with some specialized tokens and would I need to train the model from scratch using an argosdata file that contains my specific tokens ?
Argos Train is designed to train each model from scratch on a single GPU in 24 hours and doesn’t support fine tuning. I also normally don’t save the model checkpoints after I quantize them and package them as a .argosmodel file.
If you want to train with custom data you should make your own argosdata file and then train from scratch. Since all of the existing data can be accessed automatically by Argos Train this should be pretty straightforward.
The 1.9 French model is actually an Opus-MT model and isn’t trained using Argos Train. So if you want to modify it you would have to find a way to modify their model and then re-convert it for Argos Translate.
Just to play and learn, I downloaded the NLLB opus (40 GB French sentences, 34 GB English sentences) and run argos-tain against it (French to English).
After 100 GB RAM ad 60 GB swap consumption and a lot of disk I/O, it generated a 28 GB argosdatafile
Then, I read the “Sampled 1000000 sentences from 657187427 sentences” log line. So I believe that argos-train limits the training to 1 million sentences.
If this is the case, then I need to pay attention to limit the general purpose sentenses to ensure my specific sentences are taken in account.
I will also have a look to Opus-MT models and how to convert them to Argos Translate