Help Wanted: Improve en-de translation

NicoLe · June 4, 2024, 2:40pm

Hope you have not tried updating opennmt-py and ctranslate yet, there is a dependency conflict with torch. I tried to resolve it, and cannot be sure (torch has to be upgraded to 2.1 at least).

For the tokens, you had the question right.

Since my scripts prefixes the translated sentence, they should add to the vocabulary and SPM by themselves… My question is whether it is enough for the model to grasp the différent nature of BT sentences, or if I really have to prefix using the “src_prefix” option and “control_symbols.extend” in train.py?

lynxpda · June 4, 2024, 3:04pm

Theoretically it could work.
If I understood correctly, you added a certain prefix to the beginning of each sentence, e.g. .
If you train the model from scratch, it is likely that the prefix will become a single token and tag these sentences.
If, on the other hand, the model is already trained and you are just adding a new dataset, then the prefix will most likely split into multiple tokens. This will probably work too, but it has its drawbacks…

If we are talking about the most correct way, it is better to add service tokens to the SPM model in advance and add these same src_prefix tokens on the fly during training to the back-translated sentences.

NicoLe · June 4, 2024, 3:39pm

I will train the model from scratch, but you make a good point of adding a service token instead of a basic one which could pop up as placeholder in the translations.

About your version of train.py, the more I study the differences, the more I think you should introduce it as a different script, call it “expert-train” or “fine-train”. Only a few features are directly compatible with the current train.py.

Byte_fallback is one of them… does the argument --byte_fallback_off adds OOV capacity or the contrary? I am not familiar with the “store-false” value.

lynxpda · June 6, 2024, 7:19am

By default, the SPM byte_fallback function is enabled in my edition.
Adding --byte_fallback_off disables this function.

NicoLe · June 6, 2024, 10:30am

That’s what I figured out after researching a little bit.

Training Shaw20 GeGLU right now : slightly overfits compared to SwiGLU, but the val.BLEU values look improved.

As of train.py, I’ve added the extra config options and byte fallback to the script. I’ll put a PR soon.

I use an “advanced_train.py” script for the other options (prefixes and weighing as round-robin) with revised packaging at the end. Will use it this weekend train BT, if it works as planned, I’ll send you both scripts early next week.

It may be possible to use an “–advanced” argument and fuse these functions into “train.py”, but that would require quite a few “if” branches which I do not have time to implement at the moment. So I went for the easy way.

Also thinking about a stable way to implement RoPE. Only kink is that Locomotive currently uses the last torch version with a cuda runtime, and further torch versions require installing the whole cuda package. For people using Locomotive with a gamer’s PC, it may be a problem

NicoLe · June 9, 2024, 2:27pm

Hello,

From the Melbourne Uni article article @lynxpda mentioned earlier, I thought I had to duplicate the dataset and backtranslate the source to use for a source in the other direction but it hurts training a lot (although val.BLEU againt flores devtest doesn’t look too bad)…

So I re-read it (and several papers it quotes), and I am now wondering whether they really duplicated their original dataset or amplified it with a backtranslated dataset of more or less the same size.

Should I really duplicate the data, or amplify it? I have some german data to amplify it too, but I’m quite slim on english data so I’d have to collect and preprocess some before the return step.

My BT script as of now is designed for duplication, it is fast (~1M sentences per hour) and stable (25+M sentences translated succesfully with no hitch). But for dataset amplification, it needs rewriting (depending how useful, it could then be pulled to Giithub actually).

Since I told I’d share the script today, here’s a link to the current version, along with an advanced_train script developed by @lynxpda that I modified with the legacy package method (deploys a 200M model well to an LT instance, didn’t try a really big model though).

lynxpda · June 10, 2024, 8:09am

I may have misunderstood the question. I will just describe what I did myself on the example of EN_RU model (last configuration).
In the screen below, the BT sentences are highlighted. This is 38M pairs of back-translated sentences out of a total of 158M pairs of sentences (I adjusted the ratios using weights during training).

In the back-translated sentences I included texts from topics of interest to me (news, IT technology, and wikipedia for general erudition).

I collected data both by self-parsing and from OPUS (The “raw” column). I also took dictionaries for translating words from there (column “dic”):

Most importantly, the data should be from the domain you are most interested in. In terms of volume and ratio, well they should be enough (10-50% of the total can be a reasonable size, depending on what you want to get in the end).

lynxpda · June 10, 2024, 8:14am

Regarding the translation script, just in case: I don’t see a sampling setting for ctranslate2, probably defaults to greedy search. It’s fast, but the result may be a bit worse than beam search or its combination with random sampling.
Also detokenization without byte_fallback support (OOV characters will be eaten during translation).

NicoLe · June 10, 2024, 8:17am

OK, so it’s amplification…
I’ll let the duplicated set train and see where it leads though : it’s been training for a day now, and I knew presenting two versions of the same texts would hurt training, but yesterday I was even wondering whether the model would converge at all. Now it looks like it does converge and val.BLEU is acceptable for english->german.
In the meanwhile, I’ll modifiy the BT script accordingly. Have to check upgrading CT2 also to use RoPE, this training with RPE20 is bound to last 3 to 4 days.

lynxpda · June 10, 2024, 9:07am

Regarding RoPE, yesterday I trained and successfully converted a model taking into account PR.

NicoLe · June 10, 2024, 12:19pm

Regarding the dic files, how do you include them into the training?

lynxpda · June 10, 2024, 12:38pm

I use a simple script:

import os

# Define the directory and file name
directory = "./done"
dictionary_file = "et-en.dic"

def create_dic():
    with (open(os.path.join(directory, dictionary_file), 'r') as source_file,
          open(os.path.join(directory,'source.txt'), 'w') as s_file,
          open(os.path.join(directory, 'target.txt'), 'w') as t_file):
        for line in source_file:
            columns = line.strip().split('\t')
            if len(columns) >= 4:
                s_file.write(columns[2] + '\n')
                t_file.write(columns[3] + '\n')

create_dic()

NicoLe · June 10, 2024, 2:20pm

Sure works, thanks for the script. I thought I had missed an option somewhere in the existing scripts.

After upgrading for RoPE, I finally set the torch version to 2.2.2+cuda121, I put it in the requirements.txt file to make for a clean upgrade.
At first training step though, I get this warning about flash attention. Do you have it too? If not, could you please check you torch version and cuda runtime?

C:\Program Files\Python39\lib\site-packages\onmt\modules\multi_headed_attn.py:656: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  attn_output = scaled_dot_product_attention(

NicoLe · June 10, 2024, 2:31pm

Found the answer, it seems that after torch 2.1.2 on Windows, the flash attention package is absent. Downgraded to torch==2.1.2+cu121, training goes without warning (checked twice).

NicoLe · June 11, 2024, 11:06am

Well, I really wonder how @lynxpda can train with an updated CTranslate2.

I tried every possible combination of torch+cuda, onmt-py and ctranslate2 between the ones used in Locomotive and the most recent. Best case scenario training breaks at validation steps with the following error cascade (worst case scenario, training begins with error cascade -most cases- or cuda is available but does not work).

It doesn’t matter what script I use, what data, from scratch or not… this may be caused by the “filtertoolong” transform inserted in the onmt configuration, but I am unsure what to do about it.

[2024-06-11 09:42:02,000 INFO] Start training loop and validate every 100 steps...
[2024-06-11 09:42:02,015 INFO] Scoring with: ['sentencepiece', 'filtertoolong', 'prefix']
[2024-06-11 09:48:24,332 INFO] Step 50/32000; acc: 0.4; ppl: 31126.4; xent: 10.3; lr: 0.00000; sents:  295284; bsz: 6628/7296/236; 21670/23854 tok/s;    382 sec;
[2024-06-11 09:53:54,287 INFO] Step 100/32000; acc: 4.7; ppl: 25540.5; xent: 10.1; lr: 0.00000; sents:  292443; bsz: 6604/7244/234; 25018/27441 tok/s;    712 sec;
[2024-06-11 09:54:36,382 INFO] valid stats calculation
                           took: 42.094335317611694 s.
Traceback (most recent call last):
  File "C:\Program Files\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Program Files\Python39\Scripts\onmt_train.exe\__main__.py", line 7, in <module>
  File "C:\Program Files\Python39\lib\site-packages\onmt\bin\train.py", line 67, in main
    train(opt)
  File "C:\Program Files\Python39\lib\site-packages\onmt\bin\train.py", line 52, in train
    train_process(opt, device_id=0)
  File "C:\Program Files\Python39\lib\site-packages\onmt\train_single.py", line 238, in main
    trainer.train(
  File "C:\Program Files\Python39\lib\site-packages\onmt\trainer.py", line 332, in train
    valid_stats = self.validate(
  File "C:\Program Files\Python39\lib\site-packages\onmt\trainer.py", line 420, in validate
    preds, texts_ref = self.scoring_preparator.translate(
  File "C:\Program Files\Python39\lib\site-packages\onmt\utils\scoring_utils.py", line 111, in translate
    _, preds = translator._translate(
  File "C:\Program Files\Python39\lib\site-packages\onmt\translate\translator.py", line 494, in _translate
    for batch, bucket_idx in infer_iter:
  File "C:\Program Files\Python39\lib\site-packages\onmt\inputters\dynamic_iterator.py", line 341, in __iter__
    for bucket, bucket_idx in self._bucketing():
  File "C:\Program Files\Python39\lib\site-packages\onmt\inputters\dynamic_iterator.py", line 286, in _bucketing
    yield (self._tuple_to_json_with_tokIDs(bucket), self.bucket_idx)
  File "C:\Program Files\Python39\lib\site-packages\onmt\inputters\dynamic_iterator.py", line 247, in _tuple_to_json_with_tokIDs
    tuple_bucket = process(self.task, tuple_bucket)
  File "C:\Program Files\Python39\lib\site-packages\onmt\inputters\text_utils.py", line 95, in process
    transf_bucket = transform.batch_apply(
  File "C:\Program Files\Python39\lib\site-packages\onmt\transforms\transform.py", line 232, in batch_apply
    batch = transform.batch_apply(
  File "C:\Program Files\Python39\lib\site-packages\onmt\transforms\transform.py", line 70, in batch_apply
    example = self.apply(example, is_train=is_train, **kwargs)
  File "C:\Program Files\Python39\lib\site-packages\onmt\transforms\misc.py", line 56, in apply
    or len(example["tgt"]) > self.tgt_seq_length - 2
TypeError: object of type 'NoneType' has no len()
Total checkpoints: 0

lynxpda · June 11, 2024, 12:39pm

I don’t see how the errors are related to Ctranslate2 yet.
I use the following versions and get no errors:

ctranslate2-4.2.1.dist-info
torch-2.1.1+cu121.dist-info
OpenNMT_py-3.4.3.dist-info

Here is the full list from my venv:

absl						 google_auth-2.23.4.dist-info		     pkg_resources				spacy
absl_py-2.0.0.dist-info				 google_auth_oauthlib			     portalocker				spacy-3.7.2.dist-info
ahocorasick.cpython-311-x86_64-linux-gnu.so	 google_auth_oauthlib-1.0.0.dist-info	     portalocker-2.8.2.dist-info		spacy_legacy
annotated_types					 grpc					     preshed					spacy_legacy-3.0.12.dist-info
annotated_types-0.6.0.dist-info			 grpcio-1.59.3.dist-info		     preshed-3.0.9.dist-info			spacy_loggers
bitsandbytes-0.41.2-py3.11.egg			 idna					     protobuf-4.25.1.dist-info			spacy_loggers-1.0.5.dist-info
blinker						 idna-3.4.dist-info			     pyahocorasick-2.0.0.dist-info		srsly
blinker-1.7.0.dist-info				 iso639					     pyasn1					srsly-2.4.8.dist-info
blis						 iso639-0.1.4.egg-info			     pyasn1-0.5.1.dist-info			stanza
blis-0.7.11.dist-info				 isympy.py				     pyasn1_modules				stanza-1.1.1.dist-info
cachetools					 itsdangerous				     pyasn1_modules-0.3.0.dist-info		subword_nmt
cachetools-5.3.2.dist-info			 itsdangerous-2.1.2.dist-info		     pybind11					subword_nmt-0.3.8.dist-info
catalogue					 jinja2					     pybind11-2.11.1.dist-info			sympy
catalogue-2.0.10.dist-info			 Jinja2-3.1.2.dist-info			     __pycache__				sympy-1.12.dist-info
certifi						 joblib					     pydantic					tabulate
certifi-2023.11.17.dist-info			 joblib-1.3.2.dist-info			     pydantic-2.5.2.dist-info			tabulate-0.9.0.dist-info
charset_normalizer				 langcodes				     pydantic_core				tensorboard
charset_normalizer-3.3.2.dist-info		 langcodes-3.3.0.dist-info		     pydantic_core-2.14.5.dist-info		tensorboard-2.14.0.dist-info
click						 lxml					     pyonmttok					tensorboard_data_server
click-8.1.7.dist-info				 lxml-4.9.3.dist-info			     pyonmttok-1.37.1.dist-info			tensorboard_data_server-0.7.2.dist-info
cloudpathlib					 markdown				     pyonmttok.libs				tests
cloudpathlib-0.16.0.dist-info			 Markdown-3.5.1.dist-info		     PyYAML-6.0.1.dist-info			thinc
colorama					 markupsafe				     rapidfuzz					thinc-8.2.1.dist-info
colorama-0.4.6.dist-info			 MarkupSafe-2.1.3.dist-info		     rapidfuzz-3.5.2.dist-info			torch
confection					 mock					     regex					torch-2.1.1+cu121.dist-info
confection-0.1.4.dist-info			 mock-5.1.0.dist-info			     regex-2023.10.3.dist-info			torchgen
ConfigArgParse-1.7.dist-info			 mpmath					     removedup-1.0.6.dist-info			tqdm
configargparse.py				 mpmath-1.3.0.dist-info			     removedup.cpython-311-x86_64-linux-gnu.so	tqdm-4.66.1.dist-info
ctranslate2					 murmurhash				     requests					triton
ctranslate2-4.2.1.dist-info			 murmurhash-1.0.10.dist-info		     requests-2.31.0.dist-info			triton-2.1.0.dist-info
ctranslate2.libs				 networkx				     requests_oauthlib				typer
cymem						 networkx-3.2.1.dist-info		     requests_oauthlib-1.3.1.dist-info		typer-0.9.0.dist-info
cymem-2.0.8.dist-info				 numpy					     rsa					typing_extensions-4.8.0.dist-info
_distutils_hack					 numpy-1.26.2.dist-info			     rsa-4.9.dist-info				typing_extensions.py
distutils-precedence.pth			 numpy.libs				     sacrebleu					urllib3
easy-install.pth				 nvfuser				     sacrebleu-2.3.1.dist-info			urllib3-2.1.0.dist-info
fastshuffle-1.0.1.dist-info			 nvidia					     sacremoses					waitress
fastshuffle.cpython-311-x86_64-linux-gnu.so	 nvidia_cublas_cu11-11.11.3.6.dist-info      sacremoses-0.0.53.egg-info			waitress-2.1.2.dist-info
fasttext					 nvidia_cuda_nvrtc_cu11-11.8.89.dist-info    scipy					wasabi
fasttext_pybind.cpython-311-x86_64-linux-gnu.so  nvidia_cuda_runtime_cu11-11.8.89.dist-info  scipy-1.11.4.dist-info			wasabi-1.1.2.dist-info
fasttext_wheel-0.9.2.dist-info			 nvidia_cudnn_cu11-8.9.6.50.dist-info	     scipy.libs					weasel
filelock					 oauthlib				     sentencepiece				weasel-0.3.4.dist-info
filelock-3.13.1.dist-info			 oauthlib-3.2.2.dist-info		     sentencepiece-0.1.99.dist-info		werkzeug
flask						 onmt					     setuptools					werkzeug-3.0.1.dist-info
flask-3.0.0.dist-info				 OpenNMT_py-3.4.3.dist-info		     setuptools-68.1.2.dist-info		wheel
fsspec						 packaging				     six-1.16.0.dist-info			wheel-0.41.3.dist-info
fsspec-2023.10.0.dist-info			 packaging-23.2.dist-info		     six.py					_yaml
functorch					 pip					     smart_open					yaml
google						 pip-23.2.dist-info			     smart_open-6.4.0.dist-info

NicoLe · June 11, 2024, 2:48pm

In this case… well it works. It does normal PE, probably RPE as well.
But this is not something you may do with “pip install --upgrade -r requirements.txt” :

It should be installed in this order (not something the Ops team will condone, I am afraid):

torch==2.1.1 from pytorch.org/whl/cuda121 (for versions 2.1.0 and ulterior, only this works),

onmt 3.4.3

ctranslate2==4.2.1 error message appears i/o the version conflict, but it installs nonetheless

Then when using RoPE, checkpoint conversion to CT2 model bugs, an “unexpected argument” , than a second, solved by commenting said arguments @ lines 343-344 of ct2’s transformer_spec.py.

All in all, I think I should ask someone working on the CT2 project about this.

NicoLe · June 11, 2024, 4:25pm

Actually someone well versed in the language told me yesterday it was not so bad at all, and better than the DeepL alternative… would have put a present participe instead of gerondive though.

NicoLe · June 25, 2024, 9:27am

Hello,

I came back to work on backtranslation, and have questions about your remarks form two weeks ago.

Regarding the translation script, just in case: I don’t see a sampling setting for ctranslate2, probably defaults to greedy search. It’s fast, but the result may be a bit worse than beam search or its combination with random sampling.

I think there is a beam search applied at line 123 while encoding, should I script something for decoding too?

Also detokenization without byte_fallback support (OOV characters will be eaten during translation).

If the model has been trained with byte_fallback support, the tokenizer vocabulary features byte-fallback tokens, is it not enough for OOV characters to be processed?

As for rotary encoding, I had hoped it would solve the syntax issues issues between english and german, so far the results have been underwhelming. But I don’t throw the towel so easily.

Also, I experience vanishing gradients (ppl : nan) on some of the trainings, did you see this too? They appear well into learning decay, do not always disrupt training significantly, but still.

lynxpda · June 29, 2024, 4:50pm

I apologize, that’s correct, didn’t notice.

No, it’s not enough. When decoding tokens in a sentence, someone has to put the bytes back together into characters. Therefore, the SPM must do the detokining.

I don’t quite understand how this happens, I have only once encountered vanishing gradients and ppl:nan and it was a conscious experiment with the learning curve scheduler and LR size, in my mind this is the main influencing factor.

Could you tell me what scheduler and LR size you are using?