Sometimes the model you want to work with is not available at https://ollama.ai/library. If you want to try out that model before we have a chance to quantize it, you can use this process.
Not all models will work with Ollama. There are a number of factors that go into whether we are able to work with the next cool model. First it has to work with llama.cpp. Then we have to have implemented the features of llama.cpp that it requires. And then, sometimes, even with both of those, the model might not work...
At this point there are two processes you can use. You can either use a Docker container to convert and quantize, OR you can manually run the scripts. The Docker container is the easiest way to do it, but it requires you to have Docker installed on your machine. If you don't have Docker installed, you can follow the manual process.
Run docker run --rm -v /path/to/model/repo:/repo ollama/quantize -q quantlevel /repo
. For instance, if you have downloaded the latest Mistral 7B model, then clone it to your machine. Then change into that directory and you can run:
docker run --rm -v .:/repo ollama/quantize -q q4_0 /repo
You can find the different quantization levels below under Quantize the Model.
This will output two files into the directory. First is a f16.bin file that is the model converted to GGUF. The second file is a q4_0.bin file which is the model quantized to a 4 bit quantization. You should rename it to something more descriptive.
You can find the repository for the Docker container here: https://github.com/mxyng/quantize
If we know the model has a chance of working, then we need to convert and quantize. This is a matter of running two separate scripts in the llama.cpp project.
git clone https://github.com/ggerganov/llama.cpp.git
pip install torch transformers sentencepiece
python3 convert.py <modelfilename>
No need to specify fp16 or fp32.python3 convert-falcon-hf-to-gguf.py <modelfilename> <fpsize>
python3 convert-gptneox-hf-to-gguf.py <modelfilename> <fpsize>
fpsize depends on the weight size. 1 for fp16, 0 for fp32python3 convert-starcoder-hf-to-gguf.py <modelfilename> <fpsize>
fpsize depends on the weight size. 1 for fp16, 0 for fp32If the model converted successfully, there is a good chance it will also quantize successfully. Now you need to decide on the quantization to use. We will always try to create all the quantizations and upload them to the library. You should decide which level is more important to you and quantize accordingly.
The quantization options are as follows. Note that some architectures such as Falcon do not support K quants.
Run the following command quantize <converted model from above> <output file> <quantization type>
Now you can create the Ollama model. Refer to the modelfile doc for more information on doing that.