The Pile Dataset

argosopentech · September 6, 2022, 10:03pm

I’ve done some experiments training models on unstructured text data by splitting sentences and “translating” to recover the second half of the sentence.

For example:

{"q":"I baked a cake ", "source":"auto", "target":"infer"}

↓

{"translatedText": "for my friend's birthday party."}

I haven’t had very good results but I think this would work well with more powerful models. This is similar to the AlexaTM model which used a translation style encoder-decoder architecture for generating text.