Interesting thanks!
So I think the takeaway is that including large but mediocre quality datasets increases the quality a little but not a lot. For now I’ll leave CCMatrix out of the default dataset since it slows down training for a small benefit.
I think the best strategy for improving performance is to focus on collecting as much high quality data as possible.