New trained model with very poor quality

HOAZ · October 25, 2024, 10:19am

Recently I’ve trained a new model for en-fa.
I picked the Locomotive script since it looks very much more straight forward. I mainly used these text repos for my new model.

"sources": [
		"opus://OpenSubtitles",
		"opus://XLEnt",
		"opus://wikimedia",
		"opus://TED2020",
		"opus://NeuLab-TedTalks",
		"opus://Wikipedia",
		"opus://TED2013",
		"opus://infopankki",
		"opus://QED",
		"opus://GlobalVoices",
		"opus://tico-19",
		"opus://ELRC-3078-wikipedia_health",
		"file://D:\\ArgosTranslate\\parallel-datasets\\PEPCWikipedia-en_fa",
		"file://D:\\ArgosTranslate\\parallel-datasets\\NEWSparallel2024-en_fa"
    ]

Clearly I added a custom dataset for recent news translation with very good quality at the end of the list.
I trained the model only 10k steps and the last loop of the training scored: acc 30; ppl 170; xent 5.3;
I tested the model in Libretranslate and the result is very very poor with repetative text pattern as result.
The same sentences translated with argos model en-fa 1.5 result in very good qulity.
What am I doing wrong? Is this because of low training steps or there is something wrong with my custom data?
Can I now continue the process and train for more steps like 20k or 30k. What are the list of dataset used in training the argos-model for en-fa model and how many steps used to train and what are the final statistics of accuracy and perplexity values?

argosopentech · October 26, 2024, 11:59am

I do train_steps: 50000 by default it my training scripts. You should get decent results with 10k but for best performance I would recommend 30k+. I think that the OpenNMT devs recommend 100k, and you may get some performance benefits from training that long, but I think it’s mostly unnecessary.

HOAZ · October 26, 2024, 1:19pm

Right no I am training the model toward 20k steps. But how many steps did you train the en-fa.1.5 model and which data sets did you use for it, that would be great help for comparison and to know if I am on the right path. And is there any other way to fine tune a model instead of training a model from the scratch with new data?

NicoLe · October 27, 2024, 2:04pm

Good afternoon,
I would publish a list of the corpora I’ve tried so far, but here’s what I can tell so far:
0/ before using TED2020 and QED, please review the terms of use: QED is for research only, and TED2020 has a very specific fair use licensing.

1/ your data is unfiltered and contains lots of garbage.
Try using the following filters (I also filter the wiki corpora, but it’s a two-step process with langdetect , start using the “char_length” filter instead, it will do some of the job)

"filters": [
        "duplicates", 
        {"source_target_ratio": {"min": 0.5, "max": 2}},
		{"nonalphanum_ratio": {"max": 0.25}},
		{"digits_ratio": {"max": 0.25}}
    ],
"sources": [
	"opus://XLEnt",
	"opus://wikimedia",
	{
	"source": "opus://GlobalVoices",
	 "filters": [
		{"char_length": {"min": 20, "max":500}}
		]
	},
	{
	"source": "opus://NeuLab-TedTalks",
	"filters": [
		{"contains": 
			{"words": ["(Applause)", "(Laughter)", "(Piano)", "(Beep)", "Thanks."	
			]}
		}
		]
	},	
	{
	"source": "opus://QED",
	"filters": [
		{"contains": 
			{"words": ["(Applause)", "(Laughter)", "(Piano)", "(Beep)", "Thanks."	
			]}
		},
		{"char_length": {"min": 20, "max":500}}
		]
	},
	{
	"source": "opus://TED2020",
	"filters": [
		{"contains": 
			{"words": ["(Applause)", "(Laughter)", "(Piano)", "(Beep)", "Thanks."	
			]}
		},
		{"char_length": {"min": 20, "max":500}}
		]
	},
	"opus://WikiMatrix",

2/ some of the corpora you mention are basically useless as is (OpenSubtitles) but you may use the “dic” file (download manually from the opus website) as a dictionary with a little script that takes colums 2 and 3 (first two columns are index and probabilities, and so cannot be used for training with the current tools)

# Import necessary libraries
import os

# Define the directory and file name
directory = os.path.dirname(__file__)
dictionary_file = "en-fa.dic"

# Define the punctuation marks to be used
punctuation_marks = ["!", ";", ".", "?", " "]

def create_dic():
    with (open(os.path.join(directory, dictionary_file), 'r', encoding="utf-8") as source_file,
          open(os.path.join(directory, 'Open_Subtitles_dic.en'), 'w', encoding="utf-8") as two_file,
		  open(os.path.join(directory, 'Open_Subtitles_dic.fa'), 'w', encoding="utf-8") as three_file):
        for line in source_file:
            if len(line.strip()) == 0:
                continue
            else:
                columns = line.strip().split('\t')
                two_file.write(columns[2] + '\n')
                three_file.write(columns[3] + '\n')
			
create_dic()

3/ You have too litlle data (6M sentences of subtitles will only do you good for more of the same) : under 10 M sentences and with thematic corpora only, your model is unable to generalize. That’s what NLLB is there for : use the whole of it and see what comes out of it.

	{
	"source": "opus://CCMatrix",
	 "filters": [
		{"char_length": {"min": 20, "max":500}}
		]
	}
],

4/ Then, farsi grammar and world representation are quite different from english, you want to use a bigger model to encode it correctly, a larger feed_forward to withhold structure throughout the data processing and a slower gradient descent to keep training in track:

	"vocab_size": 32000,
	"save_checkpoint_steps": 500,
	"valid_steps": 500,
	"train_steps": 40000,
	"batch_size": 8192,
	"accum_count": 25,
	"enc_layers": 18,
	"dec_layers": 6,
	"transformer_ff": 3072,
	"heads": 8,
	"keep_checkpoint": 16,
	"pos_ffn_activation_fn": "gated-gelu",
	"position_encoding": "False",
	"max_relative_positions": 20

For the “gated-gelu” parameter you have to tweak a dependancy (look for post on german and russian) so use this instead: it works quite as well with no under the hood hassle.

	"transformer_ff": 4096,
	"max_relative_positions": -1

5/ In case of memory issues with these parameters, reduce batch_size to 4096 and reciprocate by raising accum_count to 50. The 40k steps of this training are equivalent to 125k+ steps of the vanilla configuration. If you experience gradient explosion (“nan” values), then lower “warmup_steps” to 5000 and it should stabilize training. I do use basic “warmup_steps” but then I have developed extra tools that refine the data and allow me to sweep through it like a snowtrack plougher, then get to the best gradient minimum I have encountered on the way there and plant the flag.

Good luck!

NicoLe · December 14, 2024, 7:50am

I am gathering data for farsi now, NLLB has good quality in English. You may use MIZAN and TEP too.
But I do not train to English, so do not wait for my model…