Llama.cpp is now 100x faster

1 Like

I think “10-100x faster” is probably an exaggeration. My understanding is that the mmap system call maps a file into memory. This gives the OS more flexibility and control over loading the model weights into memory but I don’t think it’s going to be 10x faster. It probably is faster though especially for a single inference run.

It’s very exciting to see all of this open source experimentation being done!

Agreed, the way they named the PR is a bit misleading.