Most top LLMs, with the notable exception of Gemini 1.5 and its incredible 2 million token context, cap out at 128K tokens of context, and this applies to both open-source and closed-source models.
To give you a sense of scale, 128K is roughly:
... that's impressive, but here we're only talking about raw text content. If you were to feed an HTML page as input to an LLM that supports this kind of long context, you'd lose a lot of that capacity. For example:
The key takeaway here is that form matters just as much as substance when it comes to managing your LLM's context.
So far, we've only considered the theoretical context capacity. However, it's important to know that:
Gemini is the champion in this areaAs a result, you'll encounter cases where LLMs repeat random content, generate irrelevant responses, simply summarize the context they're given instead of answering the user's question, and so on. You can try this yourself by opening a new conversation with Claude or ChatGPT and letting it run for a very long time! I'm sure your ability to reason over long contexts will be higher than the bot's... => a little nod to the doomers who think we're all going to be replaced: the work of the human mind still has a bright future ahead!
As has always been the case in Data Science and Deep Learning, preprocessing your data is essential:
Python libraries like NLTK make this easy.Similarly, ask yourself about the expected performance of the LLM you're using: am I just doing Q&A on a volume x of data? Or do I need to use my LLM for tasks that involve more reasoning? The answer to these questions will shape how you approach context management, since you'll have a much shorter context available for reasoning tasks.
Am I in a situation where I need an immediate response from the LLM (classic conversational use case), or do I need to run long-running pre- or post-processing tasks using LLMs where immediacy isn't a requirement? This question will also help you choose the right model for your needs.
Don't worry, I'm not going to leave you with nothing but context management problems. Let me walk you through one solution among many!
We'll use LangChain and LangGraph to implement the following process:
Llama3.1If you're completely unfamiliar with LangChain and LangGraph, don't panic: there are many resources available online to learn, including the excellent LangChain YouTube channel, which covers numerous use cases for agentic applications.
For this example, I chose to use llama3.1:8b because:
Let's say we want to maintain optimal performance, in terms of user experience, with this model => we'll therefore ensure the context never exceeds ~8K tokens.
Let's start by importing the necessary dependencies and creating the base node of our LangGraph graph: the one that calls the LLM.
This simple node already contains the following logic:
Now, let's define the node that will:
Finally, let's set up the pivotal node, the one that triggers summary creation once the conversation exceeds a certain number of tokens (here we used 1024K tokens instead of 8000 or less so we could run the example without having to create an extremely long conversation):
Now, we're ready to build our graph:
... note here the use of a memory checkpointer, which maintains state persistence during program execution.
Here is a visualization of our graph:
We can now execute it as follows:
Here, initially, I chat with Llama3.1 and the summary feature isn't triggered on the first three messages, as we can see here:
... but the moment I ask it to generate long content (in this case, a summary of Destiny's Child's and Beyonce's career), the following happens:
Here's the enormous response containing the history of Beyonce's career:
... but after that, our summary generation function kicks in and we end up with this much shorter text =>
Here's a summary of our conversation:
We had a fun conversation about Destiny's Child and Beyonce's solo career. I provided a detailed history of the group, including their formation in 1990, early success with albums "Destiny's Child" (1998) and "The Writing's on the Wall" (1999), lineup changes, and eventual disbandment.
I also delved into Beyonce's solo career, highlighting her debut album "Dangerously in Love" (2003), subsequent albums "B'Day" (2006), "I Am... Sasha Fierce" (2008), "4" (2011), and "Beyonce" (2013). We discussed her visual album releases, including "Beyonce" (2013) and "Lemonade" (2016).
We touched on Beyonce's social background, influences, and fans, as well as her numerous awards and accolades. I also mentioned the Knowles family, Destiny's Child's dedicated fan base (the "Destiny's Child Army"), and Beyonce's massive following (the "Beyhive").
Overall, our conversation provided a comprehensive overview of Destiny's Child and Beyonce's solo career, covering their music, legacy, and impact on the entertainment industry.
If I continued my conversation with the bot, even though our interactions consumed 1,677 tokens in four questions/answers (I verified), it's the summary that will be injected into the context from this point forward, after the old messages are deleted.
We now have a fresh conversation of only 256 tokens and we can maintain peak performance with our little Llama3.1 8B!
We've implemented a simple context compression technique this way, but many others exist!
We've written several articles on the fascinating LangChain stack in this blog, feel free to check them out:
See you soon!
CTO de la scale-up LAMALO, Yacine est un développeur fullstack qui ne tient pas en place : JavaScript, Node.js, Python, LLM, voice UX... Toujours en veille, il transforme les dernières innovations en solutions concrètes !
LinkedInGet our best articles every month.
Structurer l'innovation d'un cabinet de conseil pour transformer l'énergie créative en croissance durable.
ProjectAccompagner un jeune entrepreneur pompier dans la structuration de sa croissance et l'innovation pédagogique.
ArticlePère Castor, raconte-moi N8N N8N (prononcez « n-huit-n » ou « nodemation » si vous voulez faire classe). C'est un outil qui permet de connecter vos...
ArticleL'intelligence artificielle s'est invitée dans le quotidien des marketeurs à une vitesse record. En quelques mois, des outils comme ChatGPT,...
ArticleLe risque ? Créer une \"illusion de compétence\" tout en laissant les véritables lacunes stratégiques se creuser. La solution est pourtant simple et...
ArticleÀ lire avec la voix de Stallone : « plus de puces, plus de data, plus de milliards, le maître du monde ». Je viens de regarder le dernier numéro du...