
Using Quantized Models with Ollama for Application Development
Image by Editor
Quantization is a frequently used strategy applied to production machine learning models, particularly large and complex ones, to make them lightweight by reducing the numerical precision of the model’s parameters (weights) — usually from 32-bit floating-point to lower representations like 8-bit integers. Among the main advantages of quantizing models, memory footprint reduction and faster inference speed stand out. For these reasons, and rather unsurprisingly, quantization has proven itself very effective to “compress” large language models (LLMs) so that they can be easily deployed under resource-constrained settings, such as local machines, mobile devices, or edge servers, without the need for prohibitive computational resources. In summary, quantization allows for optimized LLM performance on available hardware.
This article describes a seamless approach to find, load, and utilize quantized language models from Hugging Face’s model repository using Ollama: an application built on top of llama.cpp that provides easy integration with nearly every model hosted on Hugging Face.
Running a Quantized Hugging Face Model with Ollama
First, make sure you have Ollama installed on your local machine. The easiest approach is to download the compatible version of Ollama with your operating system from the official website. Once installed and run, you can check that the Ollama server by typing http://localhost:11434/
in your browser. If all went smoothly, you may see a message like “Ollama is running“.
Next, let’s see how to pull and run a Hugging Face model into Ollama, not a full version, but a quantized one. There’s a command line instruction with a specific syntax to abide by for doing this:
ollama run hf.co/{username}/{repository}:{quantization} |
Let’s apply this syntax for loading a specific model — this concrete example is preceded by “!” because it has been run in a notebook instance within Visual Studio Code, you may remove it otherwise:
!ollama run hf.co/bartowski/Llama–3.2–3B–Instruct–GGUF:IQ3_M |
It is worth stopping by to break down and understand every piece in the command we just ran:
- We used Ollama to run a quantized version of a LLaMA 3.2 model hosted in Hugging Face. By accessing
hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF
in your browser, you can view the information page of this model on the Hugging Face website. Any LLM name with “instruct” on it means it has been fine-tuned to specialize in instruction-following language tasks. - The GGUF format (standing for “GPT-Generated Unified Format”) uses a model version that is optimized for local machine inference.
- IQ3_M is a specific quantization approach compatible with the aforesaid GGUF format. It is characterized by seeking a balance between speed, compression, and accuracy. While not every model accepts any quantization format, you may want to know that there are many other quantization formats, such as Q8_0 (8-bit integer quantization with maximum accuracy), Q5_K (5-bit grouped quantization focused on low memory usage), and more.
Once the quantized model is up and running, it is time to run some inference on it. An easy way to do this in Python is by using the requests
library and defining a helper function that will take a user prompt and a model name as arguments, and make a request to that model located in our Ollama server to get a response. Make sure to pip install requests first if you haven't installed this library in your environment yet:
import requests
def query_ollama(prompt, model=“hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:IQ3_M”): response = requests.post( “http://localhost:11434/api/generate”, json={ “model”: model, “prompt”: prompt, “stream”: False } ) return response.json()[“response”] |
With this function defined, if all the previous setup went correctly, running some inference examples should be quite easy:
output = query_ollama(“What is the capital of Taiwan?”) print(output) |
Output:
The capital of Taiwan is Taipei. |
And you can go beyond just short question answering. Why not try asking your quantized locally run model to create an example Python function?
output = query_ollama(“Write a Python function to check if a number is prime.”) print(output) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
**Prime Number Checker Function** =====================================
Here is a high–quality, readable, and well–documented Python function to check if a number is prime:
```python def is_prime(n: int) -> bool: “”“ Checks if a number is prime.
Args: n (int): The number to check for primality.
Returns: bool: True if the number is prime, False otherwise. ““”
# Handle edge cases if n <= 1: return False
# Check for divisibility from 2 to sqrt(n) for i in range(2, int(n ** 0.5) + 1): if n % i == 0: return False ...
This function uses a simple trial division method to check for primality. It first handles edge cases where the input number is less than or equal to 1, in which case it‘s not prime. Then, it checks for divisibility from 2 to the square root of the input number using a `for` loop. If any divisor is found, the function returns `False`, indicating that the number is not prime. Otherwise, it returns `True`.
Note that this implementation has a time complexity of O(√n), making it efficient for small to medium–sized integers. For larger numbers, you may want to consider using more advanced primality tests like the Miller–Rabin primality test. |
While checking if a number is prime is certainly not an easy nor trivial task, the model was able to navigate such a burdensome mission by proving a simplified yet solidly justified response limited to some base case scenarios. Not bad!
Wrapping Up
This article put Hugging Face language models and Ollama application for integrating and running models locally. Specifically, we focused on demystifying the process of loading and running quantized versions of popular language models, not without firstly explaining the multiple benefits of quantization for making it easier to run large language models in constrained environments.