Thought this was a good summary: 12 Critical Flaws of BLEU - Benjamin Marie's Blog
I’ve found BLEU scores are difficult to use for comparing machine translation systems because different BLEU benchmarks seem to not be very consistent. Maybe if you have a single benchmark methodology and dataset you could get a useful comparison but even then BLEU scores only loosely correlate with ratings from Human translators.
For training the Argos Translate models I haven’t used BLEU scores extensively. Instead I’ve focused on selecting high quality data and evaluated the trained models manually. I only speak English well so I test the x->en models by translating a Wikipedia article in language x to English. And I test the en->x models by translating English text to language x and then translating back to English with my models or Google Translate.
In my experience when model training goes poorly or overfits somehow it’s normally pretty obvious even if you don’t speak the language. For example, it will always return a translation of an empty string or it will always return the source text as the translation.
I’ve been finding that the reference translations in benchmark datasets like flores200 are sometimes not exactly great. That’s because of missing context or sometimes even just preference. Also as it’s pointed in the article, the smallest errors sometimes cause the biggest mistakes, e.g. a translation can deviate significantly in verbage compared to the reference and still be very correct, while another can closely match the reference expect for one important word, which totally derails the quality.