Why pretrain your own language model?

argosopentech · September 23, 2023, 3:15pm

https://skeptric.com/why-train-language-model/

I think now is a great time to pretrain your own language model from scratch. This may be a strange statement when all the best performing models today are hundreds of billions of parameters trained on specialised systems with gigantic datasets run on large expensive cluster of GPU machines for very long periods of time. However even with hundreds of millions of parameters it’s possible to train models that were state of the art 4 years ago, and are very useful for fine-tuning for specific tasks and running in production. Fine-tuning from scratch gives the ability to modify the tokenizer in ways more suitable for a task, to carefully select corpora appropriate to the task, use different model types (e.g. sparse attention), and gives a better intuition for domain adaptation. Moreover scaling laws mean that we can experiment with small pre-training to find the best model that would scale up to many tokens.

Decoder models can be trained from scratch as well; Andrej Karpathy has released nanoGPT which he claims can match the base GPT-2 model with 1 day of training on 8 x A100 40GB (which is ~$250 USD on a second tier cloud provider). The model and training are each 300 lines of code, and Karpathy released a detailed video tutorial which makes it easy to try ways to improve it.

Most people don’t have the resources to train a very large language model from scratch (and it wouldn’t be good for the environment if they did). While Google, OpenAI, and Microsoft are increasingly keeping their large language models private behind a paid API there are other initiatives like EleutherAI and BigScience releasing very large language models. But I think smaller language models still have a lot of value, and are much easier to use and adapt in a production setting; even keeping them up to date like the Online Language Modelling initiative is worthwhile. They aren’t as powerful as the very Large Language Models, but by combining them with search or logic I suspect they will be very effective in production.