Generating synthetic data for training large language models

argosopentech · August 14, 2022, 2:01pm

As more compute becomes available for training large language models I think methods to generate synthetic data will become increasingly important.

The Chinchilla paper showed that efficiently training language models requires enormous amounts of data. I suspect that the reason many of the most impressive machine learning models in the last few years have been language models is that there was an easy way to get large amounts of language data (scraping the internet).

Synthetic data could be generated using language models to generate text data or generating syllogistic expressions programmatically. There are a lot of ways you can generate valid arithmetic expression strings, for example, 2+7-4=5, you could also easily generate code and it’s output. Another option for generating data is to do reinforcement learning and have neural networks play games against themselves to generate data like DeepMind has done for StarCraft.

I’m not sure what this is likely to mean for Argos Translate and LibreTranslate. Currently Argos Translate is trained primarily on data generated by other groups on Opus so if synthetic data becomes necessary there will hopefully be datasets publicly available we can use. It’s also possible we would want to generate the data ourselves if, for example, it’s easier to generate data locally than transfer it over the internet.