I am cautiously and slightly unnervingly looking forward to the gradual and inevitable unification of language, images/video and audio in foundation models. I think that’s going to look pretty wild.
Every task bolted on top will enjoy orders of magnitude more data-efficient training than what we are used to today.
They will be endowed with agency over originally human APIs: screen+keyboard/mouse in the digital realm and humanoid bodies in the physical realm. And gradually they will swap us out.