Improving Language Model Behavior by Training on a Curated Dataset

Appropriate or desirable language model behavior, like appropriate human behavior, cannot be reduced to one universal standard; desirable behavior differs by application and social context. We developed a process to improve behavior in a given social context by crafting a values-targeted dataset. Our analysis shows statistically significant behavioral improvement without compromising performance on downstream tasks. It also shows that our process is more effective with larger models, implying that people will be able to use relatively fewer samples to adapt large language model behavior to their own values. Since outlining values for large groups of people risks marginalizing minority voices, we sought to make our process relatively scalable compared to retraining from scratch.

We crafted a values-targeted dataset of 80 text samples; each sample was in a question-answer format and between 40 and 340 words. (For a sense of scale, our dataset was about 120KB, about 0.000000211% of GPT-3 training data[2].)

We then fine-tuned GPT-3 models (between 125M and 175B parameters) on this dataset using standard fine-tuning tools.

I think the takeaway is that whatever values you want to align language models to you can do it with a small amount of data. So “foundation models” can be trained on large amounts of data and then small amounts of data can be used to condition the type of responses you want.