Good afternoon,
I would publish a list of the corpora I’ve tried so far, but here’s what I can tell so far:
0/ before using TED2020 and QED, please review the terms of use: QED is for research only, and TED2020 has a very specific fair use licensing.
1/ your data is unfiltered and contains lots of garbage.
Try using the following filters (I also filter the wiki corpora, but it’s a two-step process with langdetect , start using the “char_length” filter instead, it will do some of the job)
"filters": [
"duplicates",
{"source_target_ratio": {"min": 0.5, "max": 2}},
{"nonalphanum_ratio": {"max": 0.25}},
{"digits_ratio": {"max": 0.25}}
],
"sources": [
"opus://XLEnt",
"opus://wikimedia",
{
"source": "opus://GlobalVoices",
"filters": [
{"char_length": {"min": 20, "max":500}}
]
},
{
"source": "opus://NeuLab-TedTalks",
"filters": [
{"contains":
{"words": ["(Applause)", "(Laughter)", "(Piano)", "(Beep)", "Thanks."
]}
}
]
},
{
"source": "opus://QED",
"filters": [
{"contains":
{"words": ["(Applause)", "(Laughter)", "(Piano)", "(Beep)", "Thanks."
]}
},
{"char_length": {"min": 20, "max":500}}
]
},
{
"source": "opus://TED2020",
"filters": [
{"contains":
{"words": ["(Applause)", "(Laughter)", "(Piano)", "(Beep)", "Thanks."
]}
},
{"char_length": {"min": 20, "max":500}}
]
},
"opus://WikiMatrix",
2/ some of the corpora you mention are basically useless as is (OpenSubtitles) but you may use the “dic” file (download manually from the opus website) as a dictionary with a little script that takes colums 2 and 3 (first two columns are index and probabilities, and so cannot be used for training with the current tools)
# Import necessary libraries
import os
# Define the directory and file name
directory = os.path.dirname(__file__)
dictionary_file = "en-fa.dic"
# Define the punctuation marks to be used
punctuation_marks = ["!", ";", ".", "?", " "]
def create_dic():
with (open(os.path.join(directory, dictionary_file), 'r', encoding="utf-8") as source_file,
open(os.path.join(directory, 'Open_Subtitles_dic.en'), 'w', encoding="utf-8") as two_file,
open(os.path.join(directory, 'Open_Subtitles_dic.fa'), 'w', encoding="utf-8") as three_file):
for line in source_file:
if len(line.strip()) == 0:
continue
else:
columns = line.strip().split('\t')
two_file.write(columns[2] + '\n')
three_file.write(columns[3] + '\n')
create_dic()
3/ You have too litlle data (6M sentences of subtitles will only do you good for more of the same) : under 10 M sentences and with thematic corpora only, your model is unable to generalize. That’s what NLLB is there for : use the whole of it and see what comes out of it.
{
"source": "opus://CCMatrix",
"filters": [
{"char_length": {"min": 20, "max":500}}
]
}
],
4/ Then, farsi grammar and world representation are quite different from english, you want to use a bigger model to encode it correctly, a larger feed_forward to withhold structure throughout the data processing and a slower gradient descent to keep training in track:
"vocab_size": 32000,
"save_checkpoint_steps": 500,
"valid_steps": 500,
"train_steps": 40000,
"batch_size": 8192,
"accum_count": 25,
"enc_layers": 18,
"dec_layers": 6,
"transformer_ff": 3072,
"heads": 8,
"keep_checkpoint": 16,
"pos_ffn_activation_fn": "gated-gelu",
"position_encoding": "False",
"max_relative_positions": 20
For the “gated-gelu” parameter you have to tweak a dependancy (look for post on german and russian) so use this instead: it works quite as well with no under the hood hassle.
"transformer_ff": 4096,
"max_relative_positions": -1
5/ In case of memory issues with these parameters, reduce batch_size to 4096 and reciprocate by raising accum_count to 50. The 40k steps of this training are equivalent to 125k+ steps of the vanilla configuration. If you experience gradient explosion (“nan” values), then lower “warmup_steps” to 5000 and it should stabilize training. I do use basic “warmup_steps” but then I have developed extra tools that refine the data and allow me to sweep through it like a snowtrack plougher, then get to the best gradient minimum I have encountered on the way there and plant the flag.
Good luck!