Bleu score of libre translate models

Bleu score (flores 200)

en-ru 21.29
en-es 18.14
ca-en 35.45
en-ca 27.11
en-cs 13.6
pl-en 11.22
ga-en 25.19
fr-en 33.5
en-he 18.1
en-tr 14.22
en-id 26.57
sv-en 37.6
pt-en 38.12
en-uk 10.31
en-ko 5.02
ko-en 10.34
en-el 17.19
en-hi 24.51
id-en 21.92
nl-en 16.4
he-en 27.35
en-de 25.66
en-sk 14.81
eo-en 25.39
da-en 31.65
fi-en 19.02
en-hu 14.68
es-en 19.41
hu-en 11.01
de-en 30.31
ja-en 11.36
en-da 29.44
cs-en 18.73
it-en 22.23
ru-en 19.21
en-pt 38.16
uk-en 21.03
sk-en 18.97
en-ga 20.19
en-nl 14.81
en-ja 0.13
en-it 19.44
hi-en 21.75
en-sv 35.63
en-eo 18.77
en-fr 37.09
en-zh 0.07
en-fi 13.74
tr-en 17.7
en-pl 9.09
el-en 13.13
zh-en 11.29
2 Likes

Languages such as ko or zh have not been tokenised correctly so they have a wrong bleu scores.

1 Like

With jieba tokenization, flores 200 dataset and this config for ctranslate2:

output = self.translator.translate_batch(
            source_tokenized, 
            replace_unknowns=True,
            max_batch_size=32,
            beam_size=2,
            num_hypotheses=1,
            length_penalty=0.2,
            return_scores=False,
            return_alternatives=False,
            target_prefix=None
        )

en-ru 61.23
en-es 44.03
ca-en 59.65
en-ca 51.58
en-cs 41.87
pl-en 32.58
ga-en 50.27
fr-en 57.68
en-he 62.01
en-tr 43.57
en-id 52.46
sv-en 61.38
pt-en 61.61
en-uk 50.85
en-ko 32.84
ko-en 31.5
en-el 56.57
en-hi 62.86
id-en 47.26
nl-en 41.0
he-en 51.91
en-de 50.06
en-sk 42.61
eo-en 51.42
da-en 56.46
fi-en 43.04
en-hu 45.13
es-en 44.78
hu-en 32.61
de-en 55.6
ja-en 33.73
en-da 54.21
cs-en 43.54
it-en 48.1
ru-en 44.06
en-pt 62.19
uk-en 45.99
sk-en 44.31
en-ga 48.9
en-nl 38.64
en-ja 30.68
en-it 45.28
hi-en 47.15
en-sv 60.38
en-eo 44.95
en-fr 59.26
en-zh 12.0
en-fi 37.67
tr-en 43.21
en-pl 32.02
el-en 35.3
zh-en 33.23
2 Likes

Thanks for doing these tests!

In general I expect the translations to be higher quality for larger more widely spoken languages. These are very good and higher than I was expecting.

en-ru 61.23
en-es 44.03
ca-en 59.65
en-ca 51.58
en-cs 41.87
pl-en 32.58
ga-en 50.27
fr-en 57.68

If there are language pairs with lower BLEU scores or that users report as poor I could try retraining those models.

Edit: I fixed the scores, I had initially used the ones with broken tokenization.

1 Like

Could you share your script to run the tests of all the models? It could be useful

1 Like

yes, the code is very bad so I will work on a more clean version and publish it^^

2 Likes

yes, but use this bleu score: Bleu score of libre translate models - #3 by Jourdelune, the other score is wrong.

3 Likes

I look forward to this! Would be awesome for testing improvements on the models. I wouldn’t worry about making it perfect, it could be helpful for others even as-is.

2 Likes

Any updates @Jourdelune ?

Sorry I didn’t worked on that, currently I have a lot of thing to do so it’s not my priority^^. If you want I can publish the code in april.

2 Likes