Language model training for argos-translate/LT: Locomotive

argosopentech · September 23, 2023, 1:07am

Nice! This should help to increase the production rate for Argos Translate models. I read through the code and here are some comments:

The docs are great!
I like this syntax: "file://D:\\path\\to\\mydataset-en_es"
The “.txt” extensions in the data packages works well. I don’t currently use any file extensions on the source and target files in “.argosdata” packages because I thought I might want to do other types of data in the future (images, audio, who knows). If the data is explicitly text I think the “.txt” extension is better.
Parallel file downloads neat
There’s an OpenNMT-py/tools/average_models.py script that averages neural network checkpoints. Is there a reason you averaged the checkpoints manually instead?
The functionality for automatically calculating BLEU scores is nice to have despite the limitations of BLEU scores we’ve found.
The option to run in toy mode is a great feature. I’ve found Argos Train difficult to test in large part because a complete training run takes so long.
I’ve found that int8 quantization, like you’re using, works well but this is something you could experiment with.
I think the OpenNMT devs have been working a lot on OpenNMT data transforms which could be useful if you want to filter or clean datasets. OpenNMT-py also has functionality for dataset weighting, for example, you could use data from one dataset at double the rate for training.