Ollama: Your LLM Running Locally and Completely Free

Today I'm going to talk to you about ducks and llamas (no, not that adorable South American animal, but the open-source model created by Meta), and also a bit about Mistral.

As developers, we've all experienced this: a programming problem occupies our mind, we talk about it to a colleague who listens more or less patiently as we vent about the issue, and poof! The solution appears in our foggy mind... This is the famous rubber duck debugging theory: the listener served no purpose other than helping the speaker formulate their thoughts, and thus find the solution to their problem. Now, in addition to our colleagues, we have the option of chatting with a digital confidant right on our laptop!

In this article, I'm going to show you how to run an LLM locally on your computer.

At Reboot Conseil, we're passionate about constantly improving our services and expertise. Our clients and partners regularly ask us to integrate the versatility of LLMs into their workflows. Until now, the default choice has been to use remote APIs (OpenAI, Azure OpenAI, Vertex AI, etc.) to access these capabilities. This comes with trade-offs in terms of costs, data privacy for transmitted information, latency, and more, which need to be managed in order to benefit from the power of these tools.

Today, OpenAI dominates the market with ChatGPT because it's the most polished product from a practical standpoint; however, the open-source competition is constantly being improved and optimized. The llamas are hungry and they're making it known!

There are now numerous benchmarks to evaluate LLM performance as objectively as possible, and over time we can see that GPT 3.5 and GPT 4 are being caught up to and sometimes even surpassed on certain points!

On a much more subjective and personal level, I had fun during this winter break testing open-source LLMs, particularly in the context of task automation with LangChain, and in this article I'll give you the keys to having your own private assistant with zero latency on your machine!

Hardware Prerequisites

First things first: you need to know that running LLMs on your machine requires at least a reasonably powerful setup, but don't worry -- it's nothing out of this world! If, while reading this article, you realize you're at the limit or a bit short in terms of computing power, keep in mind that the field is evolving so fast that you'll probably be able to overcome these limitations soon with all the optimizations the open-source community is currently working on!

To determine which open-source model you can run on your machine, follow this naming convention:

7 billion parameters => 8 GB of RAM or VRAM to allocate solely for the model
13 billion parameters => 16 GB of RAM
33 billion parameters => 32 GB of RAM

What are these model "parameters"? In very simple terms, these are the rules the model generated during its training on a large volume of data, which help it generate its responses. The more parameters an LLM has, the more complex and adaptive it is. Note, however, that it's not only the number of parameters that defines a model's performance, but also how it's architected, the quality of its training data, etc.

Here's my setup, which is a mid-range laptop configuration on the market:

11th Gen Intel Core i7-11800H @ 2.30GHz x 16
NVIDIA Corporation GA104M / NVIDIA GeForce RTX 3070 Laptop GPU/PCIe/SSE2
64GB RAM
An SSD drive

I managed to run 13b models on this machine without too many issues -- latency is higher and sometimes generation stops, but it's workable. That said, for optimal comfort, I stuck with 7b. During my little benchmark, I didn't even try running the model on CPU only instead of GPU (which is possible with the tool I'm about to introduce).

EDIT: Consumed by curiosity, I just ran it on an i7 16GB RAM Ubuntu laptop equipped with a Radeon (not an Nvidia) => during installation I was notified that the model would run on CPU since I don't have an Nvidia GPU. The experience is very smooth and I can multitask while keeping the model open (browser, code editor, etc. all running simultaneously)

Ollama: Your Terminal Buddy

Now that we've covered what you'll be running your local assistant on, let's look at the tool we'll be using: ollama.

Ollama is a cross-platform software (Linux and Mac, coming soon to Windows) that lets you download numerous models to your machine and:

Chat with them from your terminal
Use them programmatically, from your Python code for example, by integrating them with libraries such as LangChain

It's very easy to use:

... for installation
... for running, and that's it! (the first time you run a model, a long download will take place, so make sure you have a good connection, e.g. near your WiFi router)

I was personally taken aback by how easy the tool is to use, a bit like when we all first discovered ChatGPT's UI when it came out. I'm currently building the habit of going to my terminal rather than ChatGPT when I need to think, and there you have it -- you now have your little ghost in the machine ready to be summoned at any time!

The Models I Use

codellama => https://huggingface.co/codellama, a variation of Llama specialized in programming
mistral => https://docs.mistral.ai/, the French model that made a splash with its Mixture Of Experts technology!

For specific code-related tasks, I tend to use codellama. In the example below I'm doing it from my terminal, but you can of course apply this prompt at scale in a Python program for example:

I generally use Mistral for tasks more related to content summarization or general questions. In both cases, it's truly a pleasure to get responses that don't depend on network latency, have no quotas, and are highly relevant! An example below with Mistral:

A Tip: Craft Your Prompts Well

I'm particularly enthusiastic about the world of use cases and opportunities that running very powerful LLMs on "consumer-grade" machines opens up in terms of capabilities; but to reap the benefits, you need to know how to prompt!

In a very quick summary, prompt engineering is the practice of giving instructions to an LLM in a way that maximizes the relevance of the results obtained. This is particularly important when you want to automate tasks at scale! We have a YouTube video on this topic, feel free to check it out.

Let's take another example of JSON generation, this time with mistral for a change:

Pretty impressive, isn't it? However, remember that this automation approach is only possible with high-quality prompts!

Now let's generate a response with the same prompt and GPT-4:

... and the response, verbose and which took a long time to generate:

We get the same result of course, and the example is clearly designed to highlight an essential advantage that local LLMs on your machine can offer: zero latency, total control over verbosity, and above all, complete data privacy!

Personally, I haven't fully switched away from ChatGPT yet, which I still find fantastic, but I'm happy to have options and I now use a mix of LLMs.

As always, feel free to leave us a comment or get in touch -- at Reboot, we love talking about AI!

Today I'm going to talk to you about ducks and llamas (no, not that adorable South American animal, but the open-source model created by Meta), and also a bit about Mistral.

In this article, I'm going to show you how to run an LLM locally on your computer.

Hardware Prerequisites

To determine which open-source model you can run on your machine, follow this naming convention:

7 billion parameters => 8 GB of RAM or VRAM to allocate solely for the model
13 billion parameters => 16 GB of RAM
33 billion parameters => 32 GB of RAM

Here's my setup, which is a mid-range laptop configuration on the market:

11th Gen Intel Core i7-11800H @ 2.30GHz x 16
NVIDIA Corporation GA104M / NVIDIA GeForce RTX 3070 Laptop GPU/PCIe/SSE2
64GB RAM
An SSD drive

EDIT: Consumed by curiosity, I just ran it on an i7 16GB RAM Ubuntu laptop equipped with a Radeon (not an Nvidia) => during installation I was notified that the model would run on CPU since I don't have an Nvidia GPU. The experience is very smooth and I can multitask while keeping the model open (browser, code editor, etc. all running simultaneously)

Ollama: Your Terminal Buddy

Now that we've covered what you'll be running your local assistant on, let's look at the tool we'll be using: ollama.

Ollama is a cross-platform software (Linux and Mac, coming soon to Windows) that lets you download numerous models to your machine and:

Chat with them from your terminal
Use them programmatically, from your Python code for example, by integrating them with libraries such as LangChain

It's very easy to use:

... for installation
... for running, and that's it! (the first time you run a model, a long download will take place, so make sure you have a good connection, e.g. near your WiFi router)

The Models I Use

codellama => https://huggingface.co/codellama, a variation of Llama specialized in programming
mistral => https://docs.mistral.ai/, the French model that made a splash with its Mixture Of Experts technology!

For specific code-related tasks, I tend to use codellama. In the example below I'm doing it from my terminal, but you can of course apply this prompt at scale in a Python program for example:

A Tip: Craft Your Prompts Well

Let's take another example of JSON generation, this time with mistral for a change:

Pretty impressive, isn't it? However, remember that this automation approach is only possible with high-quality prompts!

Now let's generate a response with the same prompt and GPT-4:

... and the response, verbose and which took a long time to generate:

Personally, I haven't fully switched away from ChatGPT yet, which I still find fantastic, but I'm happy to have options and I now use a mix of LLMs.

As always, feel free to leave us a comment or get in touch -- at Reboot, we love talking about AI!

Ollama: Your LLM Running Locally and Completely Free

Hardware Prerequisites

Ollama: Your Terminal Buddy

The Models I Use

A Tip: Craft Your Prompts Well

Similar articles

N8N, What's That All About?

How AI Is Revolutionizing Marketing (Without Replacing You)

AI Training Needs Assessment Framework: A Guide for HR Directors and Managers

Newsletter

Go further

RAG pour Accès à l'Information

Migrations .NET Framework vers ASP.NET Core 8 & Kubernetes

Mini-Application Bancaire Pédagogique avec IA Vulnérable

Extraction Documentaire Multimodale avec Gemini 2.5 Flash

Transformation IA d'un cabinet de propriété intellectuelle

Automatisation IA des audits énergétiques industriels

Ollama: Your LLM Running Locally and Completely Free

Hardware Prerequisites

Ollama: Your Terminal Buddy

The Models I Use

A Tip: Craft Your Prompts Well

Similar articles

N8N, What's That All About?

How AI Is Revolutionizing Marketing (Without Replacing You)

AI Training Needs Assessment Framework: A Guide for HR Directors and Managers

Newsletter

Go further

RAG pour Accès à l'Information

Migrations .NET Framework vers ASP.NET Core 8 & Kubernetes

Mini-Application Bancaire Pédagogique avec IA Vulnérable

Extraction Documentaire Multimodale avec Gemini 2.5 Flash

Transformation IA d'un cabinet de propriété intellectuelle

Automatisation IA des audits énergétiques industriels