Language model training for argos-translate/LT: Locomotive

I’ve been working on a set of scripts inspired by argos-train, mostly to improve on the following:

  • Support for running on Windows (my NVIDIA card is on a Windows machine)
  • Support for specifying local data sources as well as remote URLs
  • Automatic BLEU/interactive evaluation
  • Easier versioning for running multiple experiments

Install

git clone https://github.com/LibreTranslate/Locomotive --recurse-submodules
cd Locomotive
pip install -r requirements.txt

Usage

mydataset-en_es/
├── source.txt
└── target.txt

Create a config.json file specifying your sources:

{
    "from": {
        "name": "English",
        "code": "en"
    },
    "to": {
        "name": "Spanish",
        "code": "es"
    },
    "version": "1.0",
    "sources": [
        "file://D:\\path\\to\\mydataset-en_es",
        "http://data.argosopentech.com/data-ccaligned-en_es.argosdata",
    ]   
}

Train

python train.py --config config.json
==> run\en_es-1.0\translate-en_es-1_0.argosmodel

Evaluate

python eval.py --config config.json
Starting interactive mode
(en)> Hello!
(es)> ¡Hola!
(en)>
python eval.py --config config.json --bleu
BLEU score: 45.12354

Hope this can be useful to others! Some quirks might still be present, I haven’t tested on Linux/macOS, but that testing/adding support for those platform is next. I’m also looking at the possibility of having an easy “install.py” command to add the model to the local argos-package directory so that it can be used in argos-translate/LT. :tada:

1 Like

I also need to add --reverse which trains a reverse model.

1 Like

Done!

python train.py --config config.json --reverse
1 Like

Nice! This should help to increase the production rate for Argos Translate models. I read through the code and here are some comments:

  1. The docs are great!
  2. I like this syntax: "file://D:\\path\\to\\mydataset-en_es"
  3. The “.txt” extensions in the data packages works well. I don’t currently use any file extensions on the source and target files in “.argosdata” packages because I thought I might want to do other types of data in the future (images, audio, who knows). If the data is explicitly text I think the “.txt” extension is better.
  4. Parallel file downloads neat
  5. There’s an OpenNMT-py/tools/average_models.py script that averages neural network checkpoints. Is there a reason you averaged the checkpoints manually instead?
  6. The functionality for automatically calculating BLEU scores is nice to have despite the limitations of BLEU scores we’ve found.
  7. The option to run in toy mode is a great feature. I’ve found Argos Train difficult to test in large part because a complete training run takes so long.
  8. I’ve found that int8 quantization, like you’re using, works well but this is something you could experiment with.
  9. I think the OpenNMT devs have been working a lot on OpenNMT data transforms which could be useful if you want to filter or clean datasets. OpenNMT-py also has functionality for dataset weighting, for example, you could use data from one dataset at double the rate for training.
1 Like

It’s actually the same code, just edited for use as a module rather than as a script, but I found that running python within python (e.g. run a python subprocess within a python process) was giving me some issues with module imports, rather than troubleshooting those I decided to just edit the average_models.py script and use it as a module.

Yep I was looking at the weight options for different corpuses, I figured we can experiment with those once I have a simple pipeline working using a single one just like argos-train uses, but should be relatively easy to expand the program to have that ability.

1 Like

I was able to run Locomotive on a Vast.ai Linux server with a RTX4090 GPU and the demo mostly worked out of the box.

python3 train.py --config model-config.json

I did get this error from CTranslate2 failing to use CUDA but I fixed it by setting the CTranslate2 device to “cpu”:

[email protected]:~/Locomotive$ python3 eval.py --config model-config.json 
Starting interactive mode
(en)> Hello this is a test
Traceback (most recent call last):
  File "/root/Locomotive/eval.py", line 239, in <module>
    translation_obj = data["model"].translate_batch(
RuntimeError: Library libcublas.so.11 is not found or cannot be loaded

I also decreased the number of train steps to make the training run shorter:

# train.py
-    'valid_steps': 5000, 
-    'train_steps': 50000, 
+    'valid_steps': 1000, 
+    'train_steps': 5000, 

Despite the short training time and solely using the TildeMODEL dataset the model seems to generate decent quality translations.

Interactive mode

[email protected]:~/Locomotive$ python3 eval.py --config model-config.json 
Starting interactive mode
(en)> Hope this can be useful to others! Some quirks might still be present, I haven’t tested on Linux/macOS, but that testing/adding support for those platform is next. I’m also looking at the possibility of having an easy “install.py” command to add the model to the local argos-package directory so that it can be used in argos-translate/LT. :tada:
(it)> E' possibile che ciò possa essere utile per altri!, che potrebbero ancora essere presenti, hontoto sulla Linux/macOS, ma che le prove/astendono il sostegno a questa piattaforma sono prossime. I'm cerca anche di avere un semplice "impianto di comando" per aggiungere il modello al repertorio locale argos-pacchetto, affinché possa essere usato in unrgos-trans/LTda:
(en)> ^C

BLEU

[email protected]:~/Locomotive$ python3 eval.py --config model-config.json --bleu
Downloading flores200 dataset...
Tokenizer 'spm' has been changed to 'flores101', and may be removed in the future.
BLEU score: 52.79772

1 Like

Note to override valid_steps and train_steps, you can place those directly into config.json, no need to edit the py files. Strange about ctranslate2, will need to investigate that!

1 Like