Bleu score of libre translate models

Jourdelune · January 28, 2023, 6:48am

Bleu score (flores 200)

en-ru 21.29
en-es 18.14
ca-en 35.45
en-ca 27.11
en-cs 13.6
pl-en 11.22
ga-en 25.19
fr-en 33.5
en-he 18.1
en-tr 14.22
en-id 26.57
sv-en 37.6
pt-en 38.12
en-uk 10.31
en-ko 5.02
ko-en 10.34
en-el 17.19
en-hi 24.51
id-en 21.92
nl-en 16.4
he-en 27.35
en-de 25.66
en-sk 14.81
eo-en 25.39
da-en 31.65
fi-en 19.02
en-hu 14.68
es-en 19.41
hu-en 11.01
de-en 30.31
ja-en 11.36
en-da 29.44
cs-en 18.73
it-en 22.23
ru-en 19.21
en-pt 38.16
uk-en 21.03
sk-en 18.97
en-ga 20.19
en-nl 14.81
en-ja 0.13
en-it 19.44
hi-en 21.75
en-sv 35.63
en-eo 18.77
en-fr 37.09
en-zh 0.07
en-fi 13.74
tr-en 17.7
en-pl 9.09
el-en 13.13
zh-en 11.29

Jourdelune · January 28, 2023, 5:40pm

Languages such as ko or zh have not been tokenised correctly so they have a wrong bleu scores.

Jourdelune · January 28, 2023, 5:54pm

With jieba tokenization, flores 200 dataset and this config for ctranslate2:

output = self.translator.translate_batch(
            source_tokenized, 
            replace_unknowns=True,
            max_batch_size=32,
            beam_size=2,
            num_hypotheses=1,
            length_penalty=0.2,
            return_scores=False,
            return_alternatives=False,
            target_prefix=None
        )

en-ru 61.23
en-es 44.03
ca-en 59.65
en-ca 51.58
en-cs 41.87
pl-en 32.58
ga-en 50.27
fr-en 57.68
en-he 62.01
en-tr 43.57
en-id 52.46
sv-en 61.38
pt-en 61.61
en-uk 50.85
en-ko 32.84
ko-en 31.5
en-el 56.57
en-hi 62.86
id-en 47.26
nl-en 41.0
he-en 51.91
en-de 50.06
en-sk 42.61
eo-en 51.42
da-en 56.46
fi-en 43.04
en-hu 45.13
es-en 44.78
hu-en 32.61
de-en 55.6
ja-en 33.73
en-da 54.21
cs-en 43.54
it-en 48.1
ru-en 44.06
en-pt 62.19
uk-en 45.99
sk-en 44.31
en-ga 48.9
en-nl 38.64
en-ja 30.68
en-it 45.28
hi-en 47.15
en-sv 60.38
en-eo 44.95
en-fr 59.26
en-zh 12.0
en-fi 37.67
tr-en 43.21
en-pl 32.02
el-en 35.3
zh-en 33.23

argosopentech · January 29, 2023, 3:54am

Thanks for doing these tests!

In general I expect the translations to be higher quality for larger more widely spoken languages. These are very good and higher than I was expecting.

en-ru 61.23
en-es 44.03
ca-en 59.65
en-ca 51.58
en-cs 41.87
pl-en 32.58
ga-en 50.27
fr-en 57.68

If there are language pairs with lower BLEU scores or that users report as poor I could try retraining those models.

Edit: I fixed the scores, I had initially used the ones with broken tokenization.

dingedi · January 29, 2023, 9:36am

Could you share your script to run the tests of all the models? It could be useful

Jourdelune · January 29, 2023, 11:48am

yes, the code is very bad so I will work on a more clean version and publish it^^

Jourdelune · January 29, 2023, 4:42pm

yes, but use this bleu score: Bleu score of libre translate models - #3 by Jourdelune, the other score is wrong.

pierotofy · February 1, 2023, 7:01pm

I look forward to this! Would be awesome for testing improvements on the models. I wouldn’t worry about making it perfect, it could be helpful for others even as-is.

pierotofy · February 26, 2023, 7:06pm

Any updates @Jourdelune ?

Jourdelune · February 26, 2023, 7:42pm

Sorry I didn’t worked on that, currently I have a lot of thing to do so it’s not my priority^^. If you want I can publish the code in april.

dingedi · April 17, 2023, 8:18am

for the fr-en model i found a score of 36.1 with the wmt15 dataset
i didnt manage to run sacrebleu with flores200, can you share your script @Jourdelune ?
i will try to run tests for all models to check

ArtanisTheOne · April 27, 2023, 1:40am

I can attempt finding some results for this tomorrow. The BLEU scores above seem to be weirdly high, which I suspect is due to the tokenization used. I use the “flores200” sentencepiece tokenizer (you specify to corpus_bleu for sacrebleu which tokenizer to use) which supports all flores langs.

Jieba tokenization is supposed to be specifically for Chinese text segmentation - can see some weirdness with how en-ja is 0.09 BLEU (this would mean there is 1 or 2 decently translated sentences in the corpus of 1000 sents lol)

The tokenizer you use really affects how your output BLEU is scaled so be careful what you use

ArtanisTheOne · April 27, 2023, 9:51pm

Results and script below, used CTranslate2 to implement batching and speed up the process significantly.

{
   "de-en": 60.8226,
   "en-de": 53.13953,
   "en-ar": 1.02682,
   "en-es": 39.97683,
   "ar-en": 50.60414,
   "en-hi": 68.08493,
   "en-ga": 65.7327,
   "en-fr": 41.41509,
   "en-hu": 49.62286,
   "en-fi": 37.09572,
   "en-ja": 44.10758,
   "en-id": 36.87964,
   "en-it": 44.2624,
   "en-ko": 19.88853,
   "en-nl": 27.15469,
   "en-pt": 71.85321,
   "en-pl": 23.35478,
   "en-sv": 67.25956,
   "en-uk": 38.07533,
   "es-en": 36.0669,
   "fi-en": 33.93363,
   "fr-en": 20.05843,
   "hi-en": 61.38313,
   "id-en": 33.37866,
   "ga-en": 59.74705,
   "it-en": 49.87469,
   "hu-en": 26.40585,
   "pt-en": 89.59558,
   "pl-en": 27.50059,
   "ko-en": 37.24427,
   "ja-en": 35.97562,
   "sv-en": 56.56506,
   "ru-en": 50.61911,
   "nl-en": 27.47269,
   "ca-en": 42.22077,
   "cs-en": 52.70498,
   "da-en": 35.89213,
   "el-en": 38.56146,
   "az-en": 9.62307,
   "en-cs": 58.30419,
   "en-az": 5.73751,
   "en-ca": 63.21218,
   "en-da": 47.056,
   "en-eo": 26.75908,
   "en-el": 44.01217,
   "en-fa": 39.42022,
   "en-he": 47.65576,
   "en-ru": 50.98799,
   "en-sk": 43.0692,
   "en-zh": 50.45857,
   "en-tr": 18.698,
   "en-th": 35.72085,
   "eo-en": 24.34548,
   "he-en": 40.87901,
   "sk-en": 32.57252,
   "th-en": 22.66277,
   "fa-en": 40.90661
}

import time
import os
from argostranslate import package as packageManager
from sacrebleu import corpus_bleu
import sentencepiece
import ctranslate2
import threading
floresLoc = "E:\\TranslationData\\flores200_dataset\\dev\\" # downloaded files from flores

nllb_langs = {
    "af":"afr_Latn",
    "ak":"aka_Latn",
    "am":"amh_Ethi",
    "ar":"arb_Arab",
    "as":"asm_Beng",
    "ay":"ayr_Latn",
    "az":"azj_Latn",
    "bm":"bam_Latn",
    "be":"bel_Cyrl",
    "bn":"ben_Beng",
    "bho":"bho_Deva",
    "bs":"bos_Latn",
    "bg":"bul_Cyrl",
    "ca":"cat_Latn",
    "ceb":"ceb_Latn",
    "cs":"ces_Latn",
    "ckb":"ckb_Arab",
    "tt":"crh_Latn",
    "cy":"cym_Latn",
    "da":"dan_Latn",
    "de":"deu_Latn",
    "el":"ell_Grek",
    "en":"eng_Latn",
    "eo":"epo_Latn",
    "et":"est_Latn",
    "eu":"eus_Latn",
    "ee":"ewe_Latn",
    "fa":"pes_Arab",
    "fi":"fin_Latn",
    "fr":"fra_Latn",
    "gd":"gla_Latn",
    "ga":"gle_Latn",
    "gl":"glg_Latn",
    "gn":"grn_Latn",
    "gu":"guj_Gujr",
    "ht":"hat_Latn",
    "ha":"hau_Latn",
    "he":"heb_Hebr",
    "hi":"hin_Deva",
    "hr":"hrv_Latn",
    "hu":"hun_Latn",
    "hy":"hye_Armn",
    "nl":"nld_Latn",
    "ig":"ibo_Latn",
    "ilo":"ilo_Latn",
    "id":"ind_Latn",
    "is":"isl_Latn",
    "it":"ita_Latn",
    "jv":"jav_Latn",
    "ja":"jpn_Jpan",
    "kn":"kan_Knda",
    "ka":"kat_Geor",
    "kk":"kaz_Cyrl",
    "km":"khm_Khmr",
    "rw":"kin_Latn",
    "ko":"kor_Hang",
    "ku":"kmr_Latn",
    "lo":"lao_Laoo",
    "lv":"lvs_Latn",
    "ln":"lin_Latn",
    "lt":"lit_Latn",
    "lb":"ltz_Latn",
    "lg":"lug_Latn",
    "lus":"lus_Latn",
    "mai":"mai_Deva",
    "ml":"mal_Mlym",
    "mr":"mar_Deva",
    "mk":"mkd_Cyrl",
    "mg":"plt_Latn",
    "mt":"mlt_Latn",
    "mni-Mtei":"mni_Beng",
    "mni":"mni_Beng",
    "mn":"khk_Cyrl",
    "mi":"mri_Latn",
    "ms":"zsm_Latn",
    "my":"mya_Mymr",
    "no":"nno_Latn",
    "ne":"npi_Deva",
    "ny":"nya_Latn",
    "om":"gaz_Latn",
    "or":"ory_Orya",
    "pl":"pol_Latn",
    "pt":"por_Latn",
    "ps":"pbt_Arab",
    "qu":"quy_Latn",
    "ro":"ron_Latn",
    "ru":"rus_Cyrl",
    "sa":"san_Deva",
    "si":"sin_Sinh",
    "sk":"slk_Latn",
    "sl":"slv_Latn",
    "sm":"smo_Latn",
    "sn":"sna_Latn",
    "sd":"snd_Arab",
    "so":"som_Latn",
    "es":"spa_Latn",
    "sq":"als_Latn",
    "sr":"srp_Cyrl",
    "su":"sun_Latn",
    "sv":"swe_Latn",
    "sw":"swh_Latn",
    "ta":"tam_Taml",
    "te":"tel_Telu",
    "tg":"tgk_Cyrl",
    "tl":"tgl_Latn",
    "th":"tha_Thai",
    "ti":"tir_Ethi",
    "ts":"tso_Latn",
    "tk":"tuk_Latn",
    "tr":"tur_Latn",
    "ug":"uig_Arab",
    "uk":"ukr_Cyrl",
    "ur":"urd_Arab",
    "uz":"uzn_Latn",
    "vi":"vie_Latn",
    "xh":"xho_Latn",
    "yi":"ydd_Hebr",
    "yo":"yor_Latn",
    "zh-CN":"zho_Hans",
    "zh":"zho_Hans",
    "zh-TW":"zho_Hant",
    "zu":"zul_Latn",
    "pa":"pan_Guru"
}
bleu_scores = {}

def returnTranslator(file_loc) -> dict:
    model = ctranslate2.Translator(f"{file_loc}/model", device="cuda", compute_type="auto")
    tokenizer = sentencepiece.SentencePieceProcessor(
        f"{file_loc}/sentencepiece.model"
    )

    return {"model": model, "tokenizer": tokenizer}

def encode(text, tokenizer: sentencepiece.SentencePieceProcessor):
    return tokenizer.Encode(text, out_type=str)

def decode(tokens, tokenizer: sentencepiece.SentencePieceProcessor):
    return tokenizer.Decode(tokens)

def processFlores(pkg):
    data = returnTranslator(pkg.package_path)

    src_text = floresLoc + nllb_langs[pkg.from_code] + ".dev"
    tgt_text = floresLoc + nllb_langs[pkg.to_code] + ".dev"

    src_text = [line.rstrip('\n') for line in open(src_text, encoding="utf-8")]
    tgt_text = [line.rstrip('\n') for line in open(tgt_text, encoding="utf-8")]

    translation_obj = data["model"].translate_batch(
        encode(src_text, data["tokenizer"]),
        beam_size=2,
        return_scores=False, # speed up
    )

    translated_text = [
        decode(tokens.hypotheses[0], data["tokenizer"])
        for tokens in translation_obj
    ]
    bleu_scores[f"{pkg.from_code}-{pkg.to_code}"] = round(corpus_bleu(
        translated_text, [[x] for x in tgt_text], tokenize="flores200"
    ).score, 5)

    print(f"{pkg.from_code}-{pkg.to_code}: {bleu_scores[f'{pkg.from_code}-{pkg.to_code}']}")

promises = []
for package in packageManager.get_installed_packages():

    THREAD = threading.Thread(target=processFlores, args=[package,])
    promises.append(THREAD)

executing = 0

for x in promises:
    executing = sum(1 for x in promises if x.is_alive())
    while executing >= 5:
        executing = sum(1 for x in promises if x.is_alive())
        time.sleep(1)
    if executing <= 5:
        x.start()

# write dict as json to file

import json

with open("bleu_scores.json", "w") as outfile:
    outfile.write(json.dumps(bleu_scores, indent=4))

pierotofy · July 14, 2023, 4:27pm

Just noticed that beam_size in argos is 4, whereas this script used 2. Wonder if that influenced results.

Edit:

Using beam_size 4 I got: 44.2624 for English → Italian (same as the posted results), so it doesn’t seem to.