GitHub - karpathy/minbpe: Minimal, clean, code for Byte Pair Encoding

Minimal, clean code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is “byte-level” because it runs on UTF-8 encoded strings.