Announcing NLLU - No Language Left Unlocked

argosopentech · August 22, 2023, 10:25pm

CCMatrix trained a neural network to identify semantic similarity between sentences. Then they found parallel sentences that already existed on the web to use as translation data.

To address the significant computational challenges posed by comparing billions of sentences to determine which ones are mutual translations

I think this project is a sort of improvement on the CCMatrix approach enabled by better large language models. Instead of using a neural network to identify existing sentence pairs in web crawl data you’re using a neural network to generate the translation.