Playing Around with Machine Translation

Article comparing ChatGPT vs. specialized translation products.

There’s an old, old joke about machine translation. Supposedly, in the early 1960’s, IBM unveiled a computer program that could translate between English and Russian. A general from the Pentagon asked if he could try it out. “Give me a phrase in English,” the IBM technician told him. “Out of sight, out of mind,” the general replied. The technician typed it in, and a few seconds later the computer printed out a phrase in Russian. But the general didn’t speak Russian. “Have the machine translate it back into English,” he suggested. The technician complied. A moment later, out came the result: “Invisible idiot.”

Well, all I can say is that the technology has improved a great deal. Below, to start, are four passages. The first is from the recent, excellent French translation of my book Men on Horseback . The second is a translation of that passage back into English by Google Translate. The third is a translation of the passage back into English by Chat GPT. The fourth is the original passage in English.

1 Like

Related

This is how I test most of the new language models for Argos Translate for languages I don’t speak.

Yes the back-translation approach is a great method to test accuracy, yet doesn’t beat a linguist’s evaluation.

I’ve been thinking of launching an initiative to create a community group of native speakers that could help evaluate models. I just have to think of the right incentive model because I doubt people will just volunteer time to do this (and if they do, evaluations might not be done in a timely manner).

1 Like

This is a neat idea! LibreTranslate has been able to attract many open source contributors to translate the interface on Weblate and I think people would volunteer to do this too. I agree that the volunteer contributions probably won’t be as timely though. To get people to rate the translations quickly and reliably you probably need to pay them.

1 Like

Right, perhaps via a bounty/reward system.

1 Like

Have you considered such a neural network metric as COMET (COMET-22 in particular)?
When I started solving the same problem of evaluating the resulting models, COMET turned out to be the best I had found so far.
Especially if you believe:

image

The program itself:
https://pypi.org/project/unbabel-comet/

The model weights, as well as the program itself, have an open license License: Apache Software License (Apache-2.0).

It’s also quite convenient that quality is measured from 0 to 1 (where 0 is a bad translation and 1 is perfect).

Based on subjective tests on 6 models, I also saw a high correlation with my assessment.

Perhaps the only negative is that their non-reference model is released with an NC license.

1 Like

For example:

EN_to_RU
ru_lt1.7.txt(s)      score: 0.8480 (current in index)
ru_lt2.1.txt(s)      score: 0.8608 (my last Transformer Base model)
ru_lt2.2.txt(b)      score: 0.8835 (my last Transformer BIG model)
ru_gt.txt            score: 0.9060 (its Google translate)

RU_to_EN
en_lt1.0.txt(s)     score: 0.8038 (current in index)
en_lt1.2.txt(s)     score: 0.8545 (my last Transformer Base model)
en_lt1.3.txt(b)     score: 0.8598 (my last Transformer BIG model)
en_gt.txt           score: 0.8752 (its Google translate)
1 Like

I think BLUE has a lot of shortcomings; I haven’t played with COMET, but I think it would make a good improvement over evaluation metrics. Would be interesting to see if it can be integrated in OpenNMT.

1 Like

And one more thing, I noticed one strange thing: now in Locomotive the BLEU metric is still calculated automatically during the training process, but previously I calculated it through eval.py.
So, the metric that is considered during the training process has a better correlation with the subjective assessment and has the expected size, for example, for the RU-EN model is 35-37.
And the BLEU estimate via eval.py is 58-61 and changes in a strange way, as if it is not being calculated quite correctly.

I also tried to calculate BLEU through the website (here you can easily analyze it for individual sentences and different ngrams) and received a comparable estimate for models 35-37.

I also noticed a difference in BLEU between OpenNMT eval and eval.py in Locomotive, it might be due to the fact that OpenNMT uses the default tokenizer, whereas eval.py had used the “flores200” tokenizer.

I just changed the tokenizer in Locomotive’s eval.py to match the logic in OpenNMT, see if the numbers match more closely after Improve eval.py, use same BLEU tokenizer as opennmt · LibreTranslate/Locomotive@a9c2aa1 · GitHub

2 Likes

Hello everybody,
@lynxpda Could you please tell how you implemented the COMET evaluation?
It needs a source, a reference and a translation : source and ref are available from WMT, however, the translation is a local file.

comet-score -d wmt22:en-de -t PATH/TO/TRANSLATIONS

To produce the file, did you translate the whole source at once or scripted it and how?

To be clear, BLEU is unreliable to assess models, and valid BLEU is way to sensitive (a 0.2 variation translates in twice as much errors in a translated text), so I need a third tool to compare models, and comet does just that.
Thks B4hand,

1 Like

Hi!

I used COMET-22 (this is an important point). This is a reference neural network metric.
I translated a file from the FLORES dataset using a script and evaluated it using this bash script:

#!/bin/bash
# create the installer env
python3 -m venv venv

# activate installer env
source venv/bin/activate

# install requirements
python -m pip install unbabel-comet

# start
comet-score -s fl200-en.txt -t lt-et.txt -r fl200-et.txt
sacrebleu fl200-et.txt -i lt-et.txt -m bleu chrf ter

Where:
fl200-en.txt - source text for translation (FLORES200)
lt-et.txt - text translated from EN to ET using the model being checked
fl200-et.txt - reference translation (FLORES200)

In theory, this can be completely automated, but I haven’t gotten around to it yet.

1 Like