Transfer Learning

pierotofy · June 28, 2023, 6:47pm

I’ve been thinking about ways to further improve translation quality in argos-translate and LT. Wondering if anyone has tried to perform transfer learning from existing open-source models that perform well but are released under restrictive licenses.

In short:

Pick list of sentences in source language from existing source (e.g. OPUS)
Translate such list in target language using existing model - OR - validate/filter existing translations to discard bad ones (this might be tricky to automate).
Train new model using argos-train

Licensing-wise this should be allowed (the output of a model is not subject to the licensing terms of the model itself).

Thoughts?

argosopentech · June 28, 2023, 9:55pm

I think transfer learning is a very promising approach. I think we want to find ways to use the improvements in general purpose LLMs to improve our translation specific language models. I think fine tuning large models could also work well.

Current Argos Translate

Find parallel translation data only
Train a translation specific model for a single language.

Transfer learning

Find single language data (Wikipedia, web scraping)
Translate the text data with an existing language model like ChatGPT
Train a new translation model on the synthetic data

Fine tuning

Create a custom dataset for the translation task (this can be as small as 1k sentences!)
Fine tune an existing model

pierotofy · June 28, 2023, 11:00pm

Also haven’t researched this too much, but it’s my understanding that for fine tuning the tuned model inherits the license of the original model, so this can can be used on permissively licensed models, but not for those that might have restrictions (e.g. non-commercial use).