Today I'm going to talk to you about ducks and llamas (no, not that adorable South American animal, but the open-source model created by Meta), and also a bit about Mistral.
As developers, we've all experienced this: a programming problem occupies our mind, we talk about it to a colleague who listens more or less patiently as we vent about the issue, and poof! The solution appears in our foggy mind... This is the famous rubber duck debugging theory: the listener served no purpose other than helping the speaker formulate their thoughts, and thus find the solution to their problem. Now, in addition to our colleagues, we have the option of chatting with a digital confidant right on our laptop!
In this article, I'm going to show you how to run an LLM locally on your computer.
At Reboot Conseil, we're passionate about constantly improving our services and expertise. Our clients and partners regularly ask us to integrate the versatility of LLMs into their workflows. Until now, the default choice has been to use remote APIs (OpenAI, Azure OpenAI, Vertex AI, etc.) to access these capabilities. This comes with trade-offs in terms of costs, data privacy for transmitted information, latency, and more, which need to be managed in order to benefit from the power of these tools.
Today, OpenAI dominates the market with ChatGPT because it's the most polished product from a practical standpoint; however, the open-source competition is constantly being improved and optimized. The llamas are hungry and they're making it known!
There are now numerous benchmarks to evaluate LLM performance as objectively as possible, and over time we can see that GPT 3.5 and GPT 4 are being caught up to and sometimes even surpassed on certain points!
On a much more subjective and personal level, I had fun during this winter break testing open-source LLMs, particularly in the context of task automation with LangChain, and in this article I'll give you the keys to having your own private assistant with zero latency on your machine!
First things first: you need to know that running LLMs on your machine requires at least a reasonably powerful setup, but don't worry -- it's nothing out of this world! If, while reading this article, you realize you're at the limit or a bit short in terms of computing power, keep in mind that the field is evolving so fast that you'll probably be able to overcome these limitations soon with all the optimizations the open-source community is currently working on!
To determine which open-source model you can run on your machine, follow this naming convention:
What are these model "parameters"? In very simple terms, these are the rules the model generated during its training on a large volume of data, which help it generate its responses. The more parameters an LLM has, the more complex and adaptive it is. Note, however, that it's not only the number of parameters that defines a model's performance, but also how it's architected, the quality of its training data, etc.
Here's my setup, which is a mid-range laptop configuration on the market:
I managed to run 13b models on this machine without too many issues -- latency is higher and sometimes generation stops, but it's workable. That said, for optimal comfort, I stuck with 7b. During my little benchmark, I didn't even try running the model on CPU only instead of GPU (which is possible with the tool I'm about to introduce).
EDIT: Consumed by curiosity, I just ran it on an i7 16GB RAM Ubuntu laptop equipped with a Radeon (not an Nvidia) => during installation I was notified that the model would run on CPU since I don't have an Nvidia GPU. The experience is very smooth and I can multitask while keeping the model open (browser, code editor, etc. all running simultaneously)
Now that we've covered what you'll be running your local assistant on, let's look at the tool we'll be using: ollama.
Ollama is a cross-platform software (Linux and Mac, coming soon to Windows) that lets you download numerous models to your machine and:
It's very easy to use:
I was personally taken aback by how easy the tool is to use, a bit like when we all first discovered ChatGPT's UI when it came out. I'm currently building the habit of going to my terminal rather than ChatGPT when I need to think, and there you have it -- you now have your little ghost in the machine ready to be summoned at any time!
For specific code-related tasks, I tend to use codellama. In the example below I'm doing it from my terminal, but you can of course apply this prompt at scale in a Python program for example:
I generally use Mistral for tasks more related to content summarization or general questions. In both cases, it's truly a pleasure to get responses that don't depend on network latency, have no quotas, and are highly relevant! An example below with Mistral:
I'm particularly enthusiastic about the world of use cases and opportunities that running very powerful LLMs on "consumer-grade" machines opens up in terms of capabilities; but to reap the benefits, you need to know how to prompt!
In a very quick summary, prompt engineering is the practice of giving instructions to an LLM in a way that maximizes the relevance of the results obtained. This is particularly important when you want to automate tasks at scale! We have a YouTube video on this topic, feel free to check it out.
Let's take another example of JSON generation, this time with mistral for a change:
Pretty impressive, isn't it? However, remember that this automation approach is only possible with high-quality prompts!
Now let's generate a response with the same prompt and GPT-4:
... and the response, verbose and which took a long time to generate:
We get the same result of course, and the example is clearly designed to highlight an essential advantage that local LLMs on your machine can offer: zero latency, total control over verbosity, and above all, complete data privacy!
Personally, I haven't fully switched away from ChatGPT yet, which I still find fantastic, but I'm happy to have options and I now use a mix of LLMs.
As always, feel free to leave us a comment or get in touch -- at Reboot, we love talking about AI!
CTO de la scale-up LAMALO, Yacine est un développeur fullstack qui ne tient pas en place : JavaScript, Node.js, Python, LLM, voice UX... Toujours en veille, il transforme les dernières innovations en solutions concrètes !
LinkedInGet our best articles every month.
Débloquer la valeur cachée dans des milliers de documents. Un projet bancaire qui transforme la recherche documentaire en quelques secondes.
ProjectModerniser une DSI complète. Un tech lead pilotant la transformation d'une équipe.
ProjectSensibiliser aux risques IA bancaire. Un projet pédagogique démontrant 9 vulnérabilités LLM.
ProjectDébloquer l'extraction de données hétérogènes. Un projet utilisant l'IA multimodale pour 9 marques.
ProjectLever le frein de la confidentialité pour permettre l'adoption de l'IA dans un cabinet juridique.
ProjectDoubler la capacité de production d'audits grâce à l'automatisation intelligente.