Nice! This should help to increase the production rate for Argos Translate models. I read through the code and here are some comments:
- The docs are great!
- I like this syntax:
"file://D:\\path\\to\\mydataset-en_es"
- The “.txt” extensions in the data packages works well. I don’t currently use any file extensions on the
source
andtarget
files in “.argosdata” packages because I thought I might want to do other types of data in the future (images, audio, who knows). If the data is explicitly text I think the “.txt” extension is better. - Parallel file downloads neat
- There’s an OpenNMT-py/tools/average_models.py script that averages neural network checkpoints. Is there a reason you averaged the checkpoints manually instead?
- The functionality for automatically calculating BLEU scores is nice to have despite the limitations of BLEU scores we’ve found.
- The option to run in toy mode is a great feature. I’ve found Argos Train difficult to test in large part because a complete training run takes so long.
- I’ve found that int8 quantization, like you’re using, works well but this is something you could experiment with.
- I think the OpenNMT devs have been working a lot on OpenNMT data transforms which could be useful if you want to filter or clean datasets. OpenNMT-py also has functionality for dataset weighting, for example, you could use data from one dataset at double the rate for training.