Hi,
BLEU is scored towards a reference using n-grams. If your reference is not accurate, or even if it is but feature a different language level or synonyms, BLEU may swing to a huge extent.
Evaluating against the devtest flores200 dataset, those swings prevail in a particularly unpredictable manner, since this dataset features specific terminology.
They are more explainable when scoring against the dev dataset, which features more straightforward terminology. As I have noticed personally (and confirmed with several language experts on Italian, Turkish, Hindi and Swahili) BLEUdev relates well to contextually accurate vocables when translating news and routine documents.
On DE-EN, BLEUdevtest can swing between the 15-18 and 40-45 ranges without noticeable quality change in the translations. On FR-EN, I even noticed reverse swings between 47 and 60 on flores200 devtest, and 71 and 47 on some translation memoirs I use for further evaluation of professional terminology (and our translators are top-level, so I’d rather trust their translations than Meta’s).
Needless to say, I almost completely discarded BLEU scores on devtest as a valuable metric for my project, I use a composite of BLEUdev/COMETdev, and COMETdevtest.
I sometimes prioritize BLEU on the memoirs, but only provided COMET scores are similar : I routinely check translation using COMET-compare, so if two memoirs translations feature 10% difference in BLEU, but the comet compare on the memoir tells me that 11% of the translation is better with the one having the worst BLEU, I’d rather have the translators getting 10% alternate translation to what they usually write than having to correct 11% more syntax or grammar mistakes they’ll have a harder time to find.
Also, BLEU does not work on some spellings: ideographic and syllabic characters give a consistent 0 score (if you get something different on Chinese for instance, that means you have some foreign text within the translation). If you use the “Tnfq” spelling for Kabyle, that may explain your BLEU inconsistency.