Independent CTranslate2 benchmarking

argosopentech · June 11, 2023, 2:20pm

https://twitter.com/abacaj/status/1667679842416881664

I feel like not enough people mention this but ct2 can quantize and run flan-t5, falcon and MPT models. Using half the memory at int8, it is > 2x faster than HF transformers with fp16 and significantly faster than HF transformers load_in_8bit

CTranslate2 flan-t5-xxl

13.68 ms/token
12GB memory

Hugging Face flan-t5-xxl

30.5 ms/token
22.5GB memory

pierotofy · June 11, 2023, 6:55pm

Neat, but does using int8 rather than fp16 impact results? I always wonder how much quantization has an impact.

argosopentech · June 12, 2023, 1:43am

I’m not sure what quantization level he used for the test based on the way he worded it. I think int8 though.

The quantized models use less disk space and bandwidth too so they’re beneficial even without much speed or memory improvement.