Efficient Large Scale Language Modeling with Mixtures of Experts

I think mixture of expert models are a really promising area. Argos Translate currently essentially works as a mixture of expert model by choosing a language model based on source and target language. Choosing an expert from network weights might work even better.