BLEU Scoring, COMET, chrF, chrF++ on Locomotive

BoFFire · June 19, 2025, 3:48pm

Hi,

Everytime I come to the BLEU scoring, it never gives good score for english to kabyle language when training. So I think that BLEU does not support Kabyle language at all and it is not a matter of low resourced language only.

I trained locally a little model : the result is that BLEU scoring was very very bad but the translation better. The worst is BLEU the more the translation towards Kabyle language is better.

Can anyone explain to me ?
Thank you,

NicoLe · June 21, 2025, 7:58am

Hi,

BLEU is scored towards a reference using n-grams. If your reference is not accurate, or even if it is but feature a different language level or synonyms, BLEU may swing to a huge extent.

Evaluating against the devtest flores200 dataset, those swings prevail in a particularly unpredictable manner, since this dataset features specific terminology.

They are more explainable when scoring against the dev dataset, which features more straightforward terminology. As I have noticed personally (and confirmed with several language experts on Italian, Turkish, Hindi and Swahili) BLEUdev relates well to contextually accurate vocables when translating news and routine documents.

On DE-EN, BLEUdevtest can swing between the 15-18 and 40-45 ranges without noticeable quality change in the translations. On FR-EN, I even noticed reverse swings between 47 and 60 on flores200 devtest, and 71 and 47 on some translation memoirs I use for further evaluation of professional terminology (and our translators are top-level, so I’d rather trust their translations than Meta’s).

Needless to say, I almost completely discarded BLEU scores on devtest as a valuable metric for my project, I use a composite of BLEUdev/COMETdev, and COMETdevtest.

I sometimes prioritize BLEU on the memoirs, but only provided COMET scores are similar : I routinely check translation using COMET-compare, so if two memoirs translations feature 10% difference in BLEU, but the comet compare on the memoir tells me that 11% of the translation is better with the one having the worst BLEU, I’d rather have the translators getting 10% alternate translation to what they usually write than having to correct 11% more syntax or grammar mistakes they’ll have a harder time to find.

Also, BLEU does not work on some spellings: ideographic and syllabic characters give a consistent 0 score (if you get something different on Chinese for instance, that means you have some foreign text within the translation). If you use the “Tnfq” spelling for Kabyle, that may explain your BLEU inconsistency.

BoFFire · August 19, 2025, 11:32am

Thank you @NicoLe

I’m thinking if it is possible to keep Flores200 for legacy comparison while injecting for example a 10% SMOL split as validation via eval.py on Locomotive.

I think that the only thing I have to do is to map the ber-Latn the the standard kab we use. I don’t know if it is a good idea or not. Yes, I checked the SMOL translations English-Kabyle and they seem okay, may be better than some FLORES. May be this can push BLEU score up ?

Later we will try to review all the SMOL translations to make them better for eval.
What do you think ?

NicoLe · August 20, 2025, 6:41pm

Usually, you should keep evaluation datasets as is, whatever the inconsistencies.

Remember that kabyle has several dialectal variations, maybe that explains some of the differences you’ve noticed.

I have published a lot of additional code for Locomotive this year, and Piero didn’t have the time to integrate it, so I advise you go to github.com for it (in the “pull requests” tab).

With one of those features, you can build a “smol” directory alongside dev and devtest in utils/flores200… and save the english and kabyle versions of smol as the flores200 files are named. Then evaluate with the argument “–dataset smol” and get BLEU and COMET scores.
Afterwards, you can always design a composite score that you will compute yourself, that’s how many people (I for the matter) benchmark.

For mapping language code from your configuration and flores200, you may use and edit locally if needed the list I updated within data.py file.

Rewriting the list took some time, i’ve been very thoroughly mapping the opus.nlpl.eu reference codes and language names, sources and the flores200 files, looking for any discrepancies (check the comments) so you may edit locally for the evaluation, but try to follow the existing codes as much as possible when publishing models, otherwise the community will not be able to work out what you did.