I have been conducting an experiment on a small dataset of 30k segments when I noticed that a 3-layer Transformer starts to give meaningful translations faster than a 6-layer Transformer. This made me remember discussions about how Transformer...
Reading time: 1 mins 🕑
Likes: 3 ❤