What's the best configuration for learning rate, warmup and others

alikia2x · September 16, 2024, 3:00pm

Summary

So far, I have trained my own model twice, but both attempts yielded poor results. I am seeking advice on best practices for configuring parameters such as the learning rate, warmup steps, and others, as well as understanding how these parameters function.

First Attempt

During my first attempt, I set the learning rate to 0.005, with a total of 100,000 steps and approximately 16,000 warmup steps.

I discovered that changes to warmup_steps in config.json do not take effect if you resume training from a checkpoint after making the change, unless you start training from scratch. Therefore, in the first attempt, I believe there was virtually no warmup in practice.
The batch size was 3,072, but I don’t think it significantly impacted the long training process.

Ultimately, the model performed poorly. It was learning, but at a very slow pace. Specifically, the training accuracy reached only 18.5 after 100,000 steps.

The learning rate progressively decreased, ending around 5e-4.

The validation BLEU score was only 0.08, and I would also like to know if the BLEU score is valued between [0,1] or [0,100].

There are several other metrics, and I will list their final values here, which may be helpful. Generally, they improved over the 100,000 training steps (either higher or lower is better), but the improvement was quite slow.

train/ppl: 751.6
train/tgtper: 1.06e+4
train/xent: 6.622
valid/acc: 20.5
valid/tgtper: 359.8
valid/ppl: 564.1
valid/xent: 6.331

Second Attempt

After reflecting on the first attempt, I suspected that the learning rate was too low. Thus, I configured the parameters as follows:
(Irrelevant configurations, such as datasets, have been omitted)

{
    "vocab_size": 80000,
    "batch_size": 3072,
    "train_steps": 20000,
    "valid_steps": 500,
    "save_checkpoint_steps": 1000,
    "warmup_steps": 1400,
    "valid_batch_size": 3072,
    "learning_rate": 0.17,
    "max_grad_norm": 1.0,
    "accum_count": 12,
    "num_worker": 6,
    "model_dtype": "fp32",
    "keep_checkpoint": 12,
    "early_stopping": 5
}

However, the learning rate did not schedule as expected.

As I understand it, based on this configuration, the learning rate should start from a small value and linearly increase to the target learning rate (0.17 in my case).

Instead, the learning rate remained fixed at 0.004543 during the warmup steps and only decreased after the warmup period.

The learning process was indeed faster than the first attempt, but I still believe this configuration is not optimal.

I would like to know what the common values for these parameters are. By default, the learning rate remains lower than the target learning rate during warmup and never reaches or tends to get close to the target learning rate. Is this behavior normal?

pierotofy · September 16, 2024, 8:02pm

In general, Locomotive’s default values are pretty decent starting points. You are unlikely to find an answer to “what is the best set of parameters”, because the best set of parameters for one dataset will most likely not be the best for another. You’ll need to experiment.

The learning rate will be constant during the warm-up steps and will decrease according to the schedule after the warm-up steps. I think your concern is “why is my value of 0.17” not the same as the one I see from the graph, and that’s because there’s some scaling happening inside OpenNMT (I can’t remember the details), but the learning rate curve looks correct.

NicoLe · September 19, 2024, 6:45am

Hello everyone,
Based on my experience, Locomotive’s default work fine with the default settings, or the recommended “DEEP” (or “LYNX” after tweaking ctranslate2) hyperparameters configurations as well.
However, the learning rate remains quite high even when training completes, so the last model is not necessarily the most accurate.
Therefore, at the end of training, I always use the “–inflight” options to synthetize models from the following checkpoints : best model (before patience starts decreasing or stalling), highest validation accuracy, highest validation BLEU.
Then I run eval.py against every model with the --COMET argument, againt dev and devtest flores datasets.

The most reliable model will typically have the highest (performance) and the closest (consistency) scores. When the result is not obvious, I use “comet-compare” from the files that eval.py generates, and tally the wins.

Also, note that it is impossible to train big models (i.e., vectorized in 1024 dimensions) with Locomotive’s default LR settings and schedule, they diverge early in training. For those, using noam scheduler with the values indicated by “lynxpda” in one of his posts works.

alikia2x · October 3, 2024, 6:10pm

I really appreciate everyone’s responses.
After more testing, I eventually found that training worked fine with the noam optimizer and a learning rate of about 2 (maybe 2.3, 1.8, but it should be in that order of magnitude).
I ended up getting a BLEU of about 21, which proves that the BLEU metric in Tensorboard is indeed in [0,100], not [0,1].
For those who also wants to train a model, here’s my config as for a reference:

{
    "batch_size": 1536,
    "train_steps": 46500,
    "valid_steps": 500,
    "save_checkpoint_steps": 1000,
    "warmup_steps": 1500,
    "decay_method": "noam",
    "valid_batch_size": 128,
    "learning_rate": 2.3,
    "max_grad_norm": 1.0,
    "accum_count": 12,
    "num_worker": 32,
    "model_dtype": "fp32",
    "keep_checkpoint": 12,
    "early_stopping": 4,
    "bucket_size": 32768
}

Among them, the noam optimizer and the learning rate of about 10^0 are more important.

Note: Some irrelevant configurations have been omitted (such as datasets config, versions and other meta information, these can be found in the README file of Locomotive)