Karpathy on the future of large multi-modal neural networks

I am cautiously and slightly unnervingly looking forward to the gradual and inevitable unification of language, images/video and audio in foundation models. I think that’s going to look pretty wild.

Every task bolted on top will enjoy orders of magnitude more data-efficient training than what we are used to today.

They will be endowed with agency over originally human APIs: screen+keyboard/mouse in the digital realm and humanoid bodies in the physical realm. And gradually they will swap us out.

- Andrew Karpathy