Dear Community,
Quantization is a useful feature that enables to reduction of a model’s size. It comes in various flavors each with its own set of advantages and drawbacks.
Until now, we had the following support:
CTranslate2 uses INT8 quantization both on GPU and CPU.
OpenNMT-py has INT8 native pytorch quantization but only on CPU and it's likely not widely used.
A few months ago, we introduced support for bitsandbytes quantization. It supports 8-bit and two flavors of 4-bit (NF4, FP4). Thanks to bitsandbytes we can load Float16 / Float32 models and quantize on the fly while loading into memory. This allows to run inference or fine-tuning.
One advantage is that we can fine-tune models that we could not fit in memory as FP16. However,
a main drawback is that inference is slower than FP16, despite bitsandbytes advertising the contrary.
Until recently bitsandbytes could not save the model as 4-bit quantized but this feature has been added recently (though not yet available in OpenNMT-py)
MIT released a new library (llm-awq) which stands for Activation Aware Quantization (paper: [2306.00978] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration ). This method is not on-the-fly but post training quantization, saving a reduced-size model. For example, a 7 billion parameter model will be saved as a 3.9GB file.
We have just added support for already quantized models (many are available on the Hugging Face hub). We took this opportunity to revamp a new converter for all llama-like models whether they are quantized or not.
Here is an example of the syntax:
python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors
“TheBloke/Nous-Hermes-Llama2-AWQ” is the name of the repository/model on the Hugging Face Hub
–output specifies the target directory and modelname you want to save.
–format optionally you can save as safetensors
All llama-like model use a “tokenizer.model” which is downloaded during the process and we generate a vocab file that can be used later for fine-tuning.
If the model is a AWQ quantized model, we will convert to a OpenNMT-py AWQ quantized model.
You then need a config file to run translate.py or run_mmlu_opnenmt.py
transforms: [sentencepiece]
#### Subword
src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
# Model info
model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"
# Inference
seed: 42
max_length: 128 # use 1 for MMLU benchmark
gpu: 0
batch_type: sents
batch_size: 1
world_size: 1
gpu_ranks: [0]
#parallel_mode: "tensor_parallel"
precision: fp16
random_sampling_topk: 1
#random_sampling_topp: 0.6
#random_sampling_temp: 0.9
beam_size: 1
n_best: 1
profile: false
report_time: true
src: None
It is important to consider your priority:
if you need a small model file to fit in VRAM of your GPU then try AWQ, but it will be slow if you use a large batch size
AWQ models are faster then FP16 for batch_size=1
please read this: GitHub - casper-hansen/AutoAWQ: AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.
Important Note:
There are 2 AWQ toolkit (llm-awq and AutoAWQ) and AutoAWQ supports two flavors GEMM / GEMV.
Obviously the original llm-awq from MIT is not maintained periodically (compared to AutoAWQ) so we default to AutoAWQ but if a model is tagged llm-awq on the HF hub then we use AutoAWQ/GEMV which is compatible.
Last but not least:
We will provide an offline quantizer script to quantize OpenNMT-py generic models.
we have tried it and for small NMT models (NMT models are much smaller than LLMs) awq make things slower so it might not be so relevant for NMT.
Enjoy !