Article comparing ChatGPT vs. specialized translation products.
There’s an old, old joke about machine translation. Supposedly, in the early 1960’s, IBM unveiled a computer program that could translate between English and Russian. A general from the Pentagon asked if he could try it out. “Give me a phrase in English,” the IBM technician told him. “Out of sight, out of mind,” the general replied. The technician typed it in, and a few seconds later the computer printed out a phrase in Russian. But the general didn’t speak Russian. “Have the machine translate it back into English,” he suggested. The technician complied. A moment later, out came the result: “Invisible idiot.”
Well, all I can say is that the technology has improved a great deal. Below, to start, are four passages. The first is from the recent, excellent French translation of my book Men on Horseback . The second is a translation of that passage back into English by Google Translate. The third is a translation of the passage back into English by Chat GPT. The fourth is the original passage in English.
Yes the back-translation approach is a great method to test accuracy, yet doesn’t beat a linguist’s evaluation.
I’ve been thinking of launching an initiative to create a community group of native speakers that could help evaluate models. I just have to think of the right incentive model because I doubt people will just volunteer time to do this (and if they do, evaluations might not be done in a timely manner).
This is a neat idea! LibreTranslate has been able to attract many open source contributors to translate the interface on Weblate and I think people would volunteer to do this too. I agree that the volunteer contributions probably won’t be as timely though. To get people to rate the translations quickly and reliably you probably need to pay them.
Have you considered such a neural network metric as COMET (COMET-22 in particular)?
When I started solving the same problem of evaluating the resulting models, COMET turned out to be the best I had found so far.
Especially if you believe:
ru_lt1.7.txt(s) score: 0.8480 (current in index)
ru_lt2.1.txt(s) score: 0.8608 (my last Transformer Base model)
ru_lt2.2.txt(b) score: 0.8835 (my last Transformer BIG model)
ru_gt.txt score: 0.9060 (its Google translate)
en_lt1.0.txt(s) score: 0.8038 (current in index)
en_lt1.2.txt(s) score: 0.8545 (my last Transformer Base model)
en_lt1.3.txt(b) score: 0.8598 (my last Transformer BIG model)
en_gt.txt score: 0.8752 (its Google translate)
I think BLUE has a lot of shortcomings; I haven’t played with COMET, but I think it would make a good improvement over evaluation metrics. Would be interesting to see if it can be integrated in OpenNMT.
And one more thing, I noticed one strange thing: now in Locomotive the BLEU metric is still calculated automatically during the training process, but previously I calculated it through eval.py.
So, the metric that is considered during the training process has a better correlation with the subjective assessment and has the expected size, for example, for the RU-EN model is 35-37.
And the BLEU estimate via eval.py is 58-61 and changes in a strange way, as if it is not being calculated quite correctly.
I also tried to calculate BLEU through the website (here you can easily analyze it for individual sentences and different ngrams) and received a comparable estimate for models 35-37.
I also noticed a difference in BLEU between OpenNMT eval and eval.py in Locomotive, it might be due to the fact that OpenNMT uses the default tokenizer, whereas eval.py had used the “flores200” tokenizer.