Independent CTranslate2 benchmarking

I feel like not enough people mention this but ct2 can quantize and run flan-t5, falcon and MPT models. Using half the memory at int8, it is > 2x faster than HF transformers with fp16 and significantly faster than HF transformers load_in_8bit

CTranslate2 flan-t5-xxl

  • 13.68 ms/token
  • 12GB memory

Hugging Face flan-t5-xxl

  • 30.5 ms/token
  • 22.5GB memory
1 Like

Neat, but does using int8 rather than fp16 impact results? I always wonder how much quantization has an impact.

1 Like

I’m not sure what quantization level he used for the test based on the way he worded it. I think int8 though.

The quantized models use less disk space and bandwidth too so they’re beneficial even without much speed or memory improvement.

1 Like