Unified-IO - Multimodal ML with token sequences

Unified-IO is the first neural model to perform a large and diverse set of AI tasks spanning classical computer vision, image synthesis, vision-and-language, and natural language processing (NLP). Unified-IO achieves this broad unification by homogenizing every task’s input and output into a sequence of tokens drawn from a discrete and finite vocabulary. Dense inputs such as images, masks, and depth maps are converted to sequences using a universal compressor, and sparse structured inputs such as bounding boxes and human joint locations are transcribed into language, which is naturally sequential.

This approach of unifying input and output data enables us to train a single sequence-to-sequence Unified IO model to perform tasks across more than 80 diverse computer vision and NLP benchmarks.

The authors converted data from multiple modalities like image recognition into a series of discrete tokens. They then trained a large language model on those token sequences which is able to interpret text, images, and other formats.

A major advantage of this approach is more efficient use of data. The model is able to learn things in one domain, for example text, and apply it in another like image recognition. Another advantage is only needing a single model to interpret multiple types of data.