Testing and benchmarking machine translation models

Article comparing ChatGPT vs. specialized translation products.

There’s an old, old joke about machine translation. Supposedly, in the early 1960’s, IBM unveiled a computer program that could translate between English and Russian. A general from the Pentagon asked if he could try it out. “Give me a phrase in English,” the IBM technician told him. “Out of sight, out of mind,” the general replied. The technician typed it in, and a few seconds later the computer printed out a phrase in Russian. But the general didn’t speak Russian. “Have the machine translate it back into English,” he suggested. The technician complied. A moment later, out came the result: “Invisible idiot.”

Well, all I can say is that the technology has improved a great deal. Below, to start, are four passages. The first is from the recent, excellent French translation of my book Men on Horseback . The second is a translation of that passage back into English by Google Translate. The third is a translation of the passage back into English by Chat GPT. The fourth is the original passage in English.

1 Like


This is how I test most of the new language models for Argos Translate for languages I don’t speak.

Yes the back-translation approach is a great method to test accuracy, yet doesn’t beat a linguist’s evaluation.

I’ve been thinking of launching an initiative to create a community group of native speakers that could help evaluate models. I just have to think of the right incentive model because I doubt people will just volunteer time to do this (and if they do, evaluations might not be done in a timely manner).

1 Like

This is a neat idea! LibreTranslate has been able to attract many open source contributors to translate the interface on Weblate and I think people would volunteer to do this too. I agree that the volunteer contributions probably won’t be as timely though. To get people to rate the translations quickly and reliably you probably need to pay them.

1 Like

Right, perhaps via a bounty/reward system.

1 Like

Have you considered such a neural network metric as COMET (COMET-22 in particular)?
When I started solving the same problem of evaluating the resulting models, COMET turned out to be the best I had found so far.
Especially if you believe:


The program itself:

The model weights, as well as the program itself, have an open license License: Apache Software License (Apache-2.0).

It’s also quite convenient that quality is measured from 0 to 1 (where 0 is a bad translation and 1 is perfect).

Based on subjective tests on 6 models, I also saw a high correlation with my assessment.

Perhaps the only negative is that their non-reference model is released with an NC license.

1 Like

For example:

ru_lt1.7.txt(s)      score: 0.8480 (current in index)
ru_lt2.1.txt(s)      score: 0.8608 (my last Transformer Base model)
ru_lt2.2.txt(b)      score: 0.8835 (my last Transformer BIG model)
ru_gt.txt            score: 0.9060 (its Google translate)

en_lt1.0.txt(s)     score: 0.8038 (current in index)
en_lt1.2.txt(s)     score: 0.8545 (my last Transformer Base model)
en_lt1.3.txt(b)     score: 0.8598 (my last Transformer BIG model)
en_gt.txt           score: 0.8752 (its Google translate)
1 Like

I think BLUE has a lot of shortcomings; I haven’t played with COMET, but I think it would make a good improvement over evaluation metrics. Would be interesting to see if it can be integrated in OpenNMT.

1 Like

And one more thing, I noticed one strange thing: now in Locomotive the BLEU metric is still calculated automatically during the training process, but previously I calculated it through eval.py.
So, the metric that is considered during the training process has a better correlation with the subjective assessment and has the expected size, for example, for the RU-EN model is 35-37.
And the BLEU estimate via eval.py is 58-61 and changes in a strange way, as if it is not being calculated quite correctly.

I also tried to calculate BLEU through the website (here you can easily analyze it for individual sentences and different ngrams) and received a comparable estimate for models 35-37.

I also noticed a difference in BLEU between OpenNMT eval and eval.py in Locomotive, it might be due to the fact that OpenNMT uses the default tokenizer, whereas eval.py had used the “flores200” tokenizer.

I just changed the tokenizer in Locomotive’s eval.py to match the logic in OpenNMT, see if the numbers match more closely after Improve eval.py, use same BLEU tokenizer as opennmt · LibreTranslate/Locomotive@a9c2aa1 · GitHub


Hello everybody,
@lynxpda Could you please tell how you implemented the COMET evaluation?
It needs a source, a reference and a translation : source and ref are available from WMT, however, the translation is a local file.

comet-score -d wmt22:en-de -t PATH/TO/TRANSLATIONS

To produce the file, did you translate the whole source at once or scripted it and how?

To be clear, BLEU is unreliable to assess models, and valid BLEU is way to sensitive (a 0.2 variation translates in twice as much errors in a translated text), so I need a third tool to compare models, and comet does just that.
Thks B4hand,

1 Like


I used COMET-22 (this is an important point). This is a reference neural network metric.
I translated a file from the FLORES dataset using a script and evaluated it using this bash script:

# create the installer env
python3 -m venv venv

# activate installer env
source venv/bin/activate

# install requirements
python -m pip install unbabel-comet

# start
comet-score -s fl200-en.txt -t lt-et.txt -r fl200-et.txt
sacrebleu fl200-et.txt -i lt-et.txt -m bleu chrf ter

fl200-en.txt - source text for translation (FLORES200)
lt-et.txt - text translated from EN to ET using the model being checked
fl200-et.txt - reference translation (FLORES200)

In theory, this can be completely automated, but I haven’t gotten around to it yet.

1 Like

So far, I managed editing eval.py to add a “–translate_flores” argument and automate flores-200 dataset translation to a text file in the running directory, which allows for running the abovementioned script and get a score (using the --quiet and --only_system arguments is less verbose).

I have also scripted loading the COMET default model, but haven’t got around the score script yet.

When I am done, I will put the modifications in a pull request for Locomotive.

One thing, it requires installing “unbabel-comet” in the requirements, which is licensed under “Apache 2” license, nothing wrong with that?
Another, the “protobuf” dependancy in Comet is a v.4, the one in Locomotive a v.5. I am not sure what the difference is since the library is not documented on Pypi.

1 Like

So far, training has not changed with the “protobuf” dependancy at version 4, so I guess it is not that important.

Also, I ran COMET-22 by mistake on an untranslated german flores200, and obtained 0.75, that is, the relative linguistic distance between the two languages. That has given me food for thought, and I decided not to bother anymore about scripting the comet score into Locomotive’s eval.py.

1 Like

I was writing the pull requests for the translate_flores function and to add an argument to switch to flores devtest instead of dev on demand… and since it was intertwined with COMET score automation, I finished scripting it nonetheless.

Still have to test it thoroughly though : because of the protobuf discrepancy, verification will require installing locomotive with unbabel-comet from scratch so it’ll be a few weeks before coming to you.

COMET is actually helpful because devoid of the biases that hamper BLEU evaluation, even though it is far less sensitive.

An example :
BLEU-wise, BASE models are indiscernable from models with “accum_count” = 25 and transformer_ff" = 4096.

Upon manual evaluation, scores are somewhat better with the latter : BASE models are unable to translate any term above a certain number of tokens, rendering it as gibberish, whereas the “GRASP” models as I call them translate better, but use a poor syntax and some semantic shortcuts that lowers the BLEU score disproportionately.

Although the score difference is only 1%, COMET is able to highlight better translations quite consitently : one has to be aware that 1% more in COMET is actually worth a big deal for final users.

Actually, I use now both metrics : when BLEU shots up on either the dev or devtest flores it also means something : the best german->English model I have trained so far has a BLEU(dev) of 69, way above the others, and actually gives better translations, even though the COMET score is quite similar.

I’ll tell you how I came by it when I’m finished dwelling into hyperparameters for good. I already came into quite interesting conclusions that you might like to read.
For now, I established that,

  • optimal “transformer_ff” value is 8*word_vec_size (or hidden_size, the two must be equal)
  • optimal “enc_layers” values are multiples of “dec_layers”… which means 18 enc is better than 20 if you leave the decoder as default. Actually 12 layers is not bad either, depending on the quality of your training dataset it can be even better than 20…
  • there is a second degree effect between “vocab_size” and “enc_layers” : the more layers, the blurrier the classes, therefore at default (6) encoding layers value, optimal “vocab_size” is 50k, but not for 12 or 18 layers (have used 32k, testing 40k)
  • the more layers, the blurrier the classes, so I have dabbled with a known antidote : “label_smoothing” After fumbling quite a few times, I found 0.125 to improve marginally the synonym issue on an 18 layers model
  • also noticed that models synthetized in retrospect with the “inflight” argument from one of the two checkpoints having the highest “val.BLEU” value are often better than end-of-training ones…

Of course, all of this is highly dependent on the quality of your training data, which is 80% of your model’s ability indeed. But the last 20% helps beating many of the publicly available systems (except Google for exactitude and DeepL for fluidity).

Google’s job is data so there is no way they won’t always have the best aligned sources to train their models, hence be in front regarding semantic exactitude

With DeepL, an OpenNMT model can actually score at par word-to-word-wise. Their advantage lies in fluidity. How do they obtain such verismilitude in translation is anyone’s guess, mine being that are likely to use Levenshtein-like decoders, probably in combination with some conformer features.

What I came to take as a certainty is that usual transformers simply are unable of such fluidity. It takes in-depth architecture modification to recognize and rearrange not only individual tokens but also process them within syntagmic blocks of tokens probably defined as a class. I’m not even sure the attention mechanism is apt at this kind of task, it looks more like a convolutional mechanism or placeholder susbtitution being at work, hence my guess.


Great research! I fully support it, I also came to similar results with regards to COMET-22, somewhere after 85%, with every additional 1% increase, the quality of the model grows exponentially. And in my opinion, COMET-22 more reliably reflects the real quality of the translation.

I also wanted to make an announcement about my research, I still managed to train a model that is ahead of Google in the BLEU metric and a little bit (90% vs 90.6% underperforming in the COMET-22 metric). I’ll post more details later in a separate topic, but in general:

  • Increased the number of decoder layers (now 20+20)
  • Changed the activation function to GeGLU
  • changed positional encoding to relative positional encoding (ROPE works even better, PR is now open in Ctranslate2 to support model conversion)
  • For now, I changed my version of Locomotive locally by adding support for prefixes, suffixes, updating the dictionary and resetting the optimizer parameters, added support for the vocab with byte_fallback options (there are no more tokens, the translation has everything, smileys and hieroglyphs and Arabic symbols, everything that is not there in the dictionary is now also processed!) (later I will also open a PR with all the changes)

The next thing I’m working on now is working with pre-trained models. This is an opportunity to train a new language pair on a 450M model in just 24 hours on a simple RTX 3060.
If possible, you will need to choose the optimal base model that provides maximum performance at a size of up to 200M.

Actually, another advantage of integrating COMET into Locomotive is that you also can use then use the comet-compare command (although I would not dabble in scripting it, eval.py is designed to evaluate one model, not to compare several) to compare hyperparameters effects individually, and after the last PR we did yesterday, it is quite easy to get both flores200 datasets translated.

That is how I plan to confirm what I told earlier…

And actually, yes, dabbling with not obvious parameters is likely to yield results, although not cosnsitently. My trials with label smoothing and tranformer_ff are a notorious example of that.

@lynxpda : on which language pair have you come to par with Google?

For model EN_to_RU

Model BLEU COMET-22 Note
ru_lt1.7.txt(s) 27.60 0.8480
ru_lt2.1.txt(s) 28.80 0.8608 (PPL12ST200kBS0.1m)
ru_lt2.2.txt(b) 32.00 0.8835 (PPL9.6ST140kBS0.1m)
(model in index v1.9)
ru_lt2.5.txt(md) 33.40 0.8941 (PPL9.02ST131kBS0.4m)
ru_lt2.8.txt 34.40 0.9000 (PPL8.27)
ru_gt.txt 34.30 0.9060 Google Translate

Each subsequent improvement becomes more and more difficult, despite the huge number of optimizations. Subjectively, yes, the quality is on par with Google, in some places a little better, in others a little worse, but now I’m completing my task in terms of quality.

Can you explain the difference between versions 2.1 to 2.8 more extensively please? I work on german->english and my ppls are around 8.6 currently (see the corresponding post for further info about data curation)

Actually, regarding the encoding/decoding layers, my opinion is you actually gain using a 20+20 instead of a 20+6 because you restore congruence between encoder and decoder.
Using comet-compare 3 to 3, with models made from the same source target and sentencepiece model, these rank as follows : 18/6>20/6>=12/6>12/12>6/6>8/6
20/6 and 12/6 are tied on flores200-devtest dataset, with a fair advantage for 12/6, on manual evals, the advantage is reversed but still a tie, and 20/6 wins on flores200-dev dataset. That is my reason for not discarding the dev dataset, it gives interesting second opinion

You might have a try at 20/10 or 20/5 models and cash some more… I think decoder depth has to be regular and allow consistent insertion of the encoder layers output to the decoder’s attention mechanism, but it does not have to be symmetrical.
As for deep decoder performing badly at inference, I did some load assessments using “Locust” on a self-hosted instance, and noticed a little difference, but nothing terrible.