Playing Around with Machine Translation

Article comparing ChatGPT vs. specialized translation products.

There’s an old, old joke about machine translation. Supposedly, in the early 1960’s, IBM unveiled a computer program that could translate between English and Russian. A general from the Pentagon asked if he could try it out. “Give me a phrase in English,” the IBM technician told him. “Out of sight, out of mind,” the general replied. The technician typed it in, and a few seconds later the computer printed out a phrase in Russian. But the general didn’t speak Russian. “Have the machine translate it back into English,” he suggested. The technician complied. A moment later, out came the result: “Invisible idiot.”

Well, all I can say is that the technology has improved a great deal. Below, to start, are four passages. The first is from the recent, excellent French translation of my book Men on Horseback . The second is a translation of that passage back into English by Google Translate. The third is a translation of the passage back into English by Chat GPT. The fourth is the original passage in English.

1 Like

Related

This is how I test most of the new language models for Argos Translate for languages I don’t speak.

Yes the back-translation approach is a great method to test accuracy, yet doesn’t beat a linguist’s evaluation.

I’ve been thinking of launching an initiative to create a community group of native speakers that could help evaluate models. I just have to think of the right incentive model because I doubt people will just volunteer time to do this (and if they do, evaluations might not be done in a timely manner).

1 Like

This is a neat idea! LibreTranslate has been able to attract many open source contributors to translate the interface on Weblate and I think people would volunteer to do this too. I agree that the volunteer contributions probably won’t be as timely though. To get people to rate the translations quickly and reliably you probably need to pay them.

1 Like

Right, perhaps via a bounty/reward system.

1 Like

Have you considered such a neural network metric as COMET (COMET-22 in particular)?
When I started solving the same problem of evaluating the resulting models, COMET turned out to be the best I had found so far.
Especially if you believe:

image

The program itself:
https://pypi.org/project/unbabel-comet/

The model weights, as well as the program itself, have an open license License: Apache Software License (Apache-2.0).

It’s also quite convenient that quality is measured from 0 to 1 (where 0 is a bad translation and 1 is perfect).

Based on subjective tests on 6 models, I also saw a high correlation with my assessment.

Perhaps the only negative is that their non-reference model is released with an NC license.

1 Like

For example:

EN_to_RU
ru_lt1.7.txt(s)      score: 0.8480 (current in index)
ru_lt2.1.txt(s)      score: 0.8608 (my last Transformer Base model)
ru_lt2.2.txt(b)      score: 0.8835 (my last Transformer BIG model)
ru_gt.txt            score: 0.9060 (its Google translate)

RU_to_EN
en_lt1.0.txt(s)     score: 0.8038 (current in index)
en_lt1.2.txt(s)     score: 0.8545 (my last Transformer Base model)
en_lt1.3.txt(b)     score: 0.8598 (my last Transformer BIG model)
en_gt.txt           score: 0.8752 (its Google translate)
1 Like

I think BLUE has a lot of shortcomings; I haven’t played with COMET, but I think it would make a good improvement over evaluation metrics. Would be interesting to see if it can be integrated in OpenNMT.

1 Like

And one more thing, I noticed one strange thing: now in Locomotive the BLEU metric is still calculated automatically during the training process, but previously I calculated it through eval.py.
So, the metric that is considered during the training process has a better correlation with the subjective assessment and has the expected size, for example, for the RU-EN model is 35-37.
And the BLEU estimate via eval.py is 58-61 and changes in a strange way, as if it is not being calculated quite correctly.

I also tried to calculate BLEU through the website (here you can easily analyze it for individual sentences and different ngrams) and received a comparable estimate for models 35-37.

I also noticed a difference in BLEU between OpenNMT eval and eval.py in Locomotive, it might be due to the fact that OpenNMT uses the default tokenizer, whereas eval.py had used the “flores200” tokenizer.

I just changed the tokenizer in Locomotive’s eval.py to match the logic in OpenNMT, see if the numbers match more closely after Improve eval.py, use same BLEU tokenizer as opennmt · LibreTranslate/Locomotive@a9c2aa1 · GitHub

2 Likes