Long Context Management and LLM Usage: A Reality Check with a Mitigation Example

Tokens and Orders of Magnitude

Most top LLMs, with the notable exception of Gemini 1.5 and its incredible 2 million token context, cap out at 128K tokens of context, and this applies to both open-source and closed-source models.

To give you a sense of scale, 128K is roughly:

123 Stack Overflow questions and their answers
300 pages of plain text (an English-language novella)
1,680 tweets

... that's impressive, but here we're only talking about raw text content. If you were to feed an HTML page as input to an LLM that supports this kind of long context, you'd lose a lot of that capacity. For example:

a Wikipedia article pulled from the web and fed directly to an LLM would use roughly ~42K tokens on its own (about 33% of the theoretical capacity)
a SERP (Google search results page) fed as-is to an LLM would consume about 250K tokens of context (due to all the scripts, metadata, etc. that these pages are packed with)

The key takeaway here is that form matters just as much as substance when it comes to managing your LLM's context.

Distinguishing Between Task Types

So far, we've only considered the theoretical context capacity. However, it's important to know that:

LLMs lose performance as context grows, because the attention mechanism that underpins them is put under severe strain when it comes to prioritizing and retaining the most relevant information in very long contexts (this problem is called "lost in the middle")
the herculean effort we ask of LLMs in our daily tasks is actually two-dimensional:
- tasks that simply involve retrieving information, as-is, from a long context: this is where the biggest advances have been made and where LLMs show consistent performance (whether proprietary or open); Gemini is the champion in this area
- tasks involving significant reasoning: resource-intensive and complex, we observe that with most long-context LLMs, reasoning performance degrades beyond 16K or 32K tokens, sometimes even less (depending on the model)
finally, the longer the context, the slower the response will be generated because the computational cost will be higher

As a result, you'll encounter cases where LLMs repeat random content, generate irrelevant responses, simply summarize the context they're given instead of answering the user's question, and so on. You can try this yourself by opening a new conversation with Claude or ChatGPT and letting it run for a very long time! I'm sure your ability to reason over long contexts will be higher than the bot's... => a little nod to the doomers who think we're all going to be replaced: the work of the human mind still has a bright future ahead!

Practical Tips and Checklist

As has always been the case in Data Science and Deep Learning, preprocessing your data is essential:

have you removed all unnecessary content from your input (e.g., metadata and various scripts from HTML pages)?
have you lemmatized the content? Lemmatization is a process that reduces a word to its base form, regardless of its grammatical variations, for example converting a conjugated verb to its infinitive. Python libraries like NLTK make this easy.

Similarly, ask yourself about the expected performance of the LLM you're using: am I just doing Q&A on a volume x of data? Or do I need to use my LLM for tasks that involve more reasoning? The answer to these questions will shape how you approach context management, since you'll have a much shorter context available for reasoning tasks.

Am I in a situation where I need an immediate response from the LLM (classic conversational use case), or do I need to run long-running pre- or post-processing tasks using LLMs where immediacy isn't a requirement? This question will also help you choose the right model for your needs.

Context Management Example: Conversational Use Case with Llama3.1

Don't worry, I'm not going to leave you with nothing but context management problems. Let me walk you through one solution among many!

We'll use LangChain and LangGraph to implement the following process:

we initialize a simple conversational graph with Llama3.1
once a certain number of tokens has been consumed in the conversation, we trigger a logic that will:
- summarize the entire previous conversation using the same LLM
- re-inject this summary into the conversation
- delete the other messages to reduce context, keeping only the conversation summary and the user's latest message
continue the conversation

If you're completely unfamiliar with LangChain and LangGraph, don't panic: there are many resources available online to learn, including the excellent LangChain YouTube channel, which covers numerous use cases for agentic applications.

For this example, I chose to use llama3.1:8b because:

it can run on consumer-grade machines privately
it supports 128K tokens of context in theory, but its performance degrades progressively beyond 8K tokens, becoming barely usable at 96K tokens; this is a perfect illustration of the problem we're addressing

Let's say we want to maintain optimal performance, in terms of user experience, with this model => we'll therefore ensure the context never exceeds ~8K tokens.

Let's start by importing the necessary dependencies and creating the base node of our LangGraph graph: the one that calls the LLM.

This simple node already contains the following logic:

if a summary exists in the current graph state, it's added to the conversation message list
the number of tokens consumed (whether incoming or outgoing) is updated after each round of conversation involving the LLM

Now, let's define the node that will:

generate the conversation summary using the same LLM (we could use a different, more capable model for retrieval and summarization tasks)
delete all conversation messages except 2: the conversation summary so far, and the user's last interaction

Finally, let's set up the pivotal node, the one that triggers summary creation once the conversation exceeds a certain number of tokens (here we used 1024K tokens instead of 8000 or less so we could run the example without having to create an extremely long conversation):

Now, we're ready to build our graph:

... note here the use of a memory checkpointer, which maintains state persistence during program execution.

Here is a visualization of our graph:

We can now execute it as follows:

Here, initially, I chat with Llama3.1 and the summary feature isn't triggered on the first three messages, as we can see here:

... but the moment I ask it to generate long content (in this case, a summary of Destiny's Child's and Beyonce's career), the following happens:

a conversation summary is created
the old messages are deleted from the conversation, replaced by the summary
we also display the total tokens consumed

Here's the enormous response containing the history of Beyonce's career:

... but after that, our summary generation function kicks in and we end up with this much shorter text =>

Here's a summary of our conversation:

We had a fun conversation about Destiny's Child and Beyonce's solo career. I provided a detailed history of the group, including their formation in 1990, early success with albums "Destiny's Child" (1998) and "The Writing's on the Wall" (1999), lineup changes, and eventual disbandment.

I also delved into Beyonce's solo career, highlighting her debut album "Dangerously in Love" (2003), subsequent albums "B'Day" (2006), "I Am... Sasha Fierce" (2008), "4" (2011), and "Beyonce" (2013). We discussed her visual album releases, including "Beyonce" (2013) and "Lemonade" (2016).

We touched on Beyonce's social background, influences, and fans, as well as her numerous awards and accolades. I also mentioned the Knowles family, Destiny's Child's dedicated fan base (the "Destiny's Child Army"), and Beyonce's massive following (the "Beyhive").

Overall, our conversation provided a comprehensive overview of Destiny's Child and Beyonce's solo career, covering their music, legacy, and impact on the entertainment industry.

If I continued my conversation with the bot, even though our interactions consumed 1,677 tokens in four questions/answers (I verified), it's the summary that will be injected into the context from this point forward, after the old messages are deleted.

We now have a fresh conversation of only 256 tokens and we can maintain peak performance with our little Llama3.1 8B!

We've implemented a simple context compression technique this way, but many others exist!

Reboot Loves LangChain and Agentic Applications!

We've written several articles on the fascinating LangChain stack in this blog, feel free to check them out:

test-drive of LangChain
the LangChain ecosystem
getting started with LangServe
introduction to the ReAct pattern with LangGraph
building an article generator with LangGraph

See you soon!

Tokens and Orders of Magnitude

To give you a sense of scale, 128K is roughly:

123 Stack Overflow questions and their answers
300 pages of plain text (an English-language novella)
1,680 tweets

a Wikipedia article pulled from the web and fed directly to an LLM would use roughly ~42K tokens on its own (about 33% of the theoretical capacity)
a SERP (Google search results page) fed as-is to an LLM would consume about 250K tokens of context (due to all the scripts, metadata, etc. that these pages are packed with)

The key takeaway here is that form matters just as much as substance when it comes to managing your LLM's context.

Distinguishing Between Task Types

So far, we've only considered the theoretical context capacity. However, it's important to know that:

LLMs lose performance as context grows, because the attention mechanism that underpins them is put under severe strain when it comes to prioritizing and retaining the most relevant information in very long contexts (this problem is called "lost in the middle")
the herculean effort we ask of LLMs in our daily tasks is actually two-dimensional:
- tasks that simply involve retrieving information, as-is, from a long context: this is where the biggest advances have been made and where LLMs show consistent performance (whether proprietary or open); Gemini is the champion in this area
- tasks involving significant reasoning: resource-intensive and complex, we observe that with most long-context LLMs, reasoning performance degrades beyond 16K or 32K tokens, sometimes even less (depending on the model)
finally, the longer the context, the slower the response will be generated because the computational cost will be higher

Practical Tips and Checklist

As has always been the case in Data Science and Deep Learning, preprocessing your data is essential:

have you removed all unnecessary content from your input (e.g., metadata and various scripts from HTML pages)?
have you lemmatized the content? Lemmatization is a process that reduces a word to its base form, regardless of its grammatical variations, for example converting a conjugated verb to its infinitive. Python libraries like NLTK make this easy.

Context Management Example: Conversational Use Case with Llama3.1

Don't worry, I'm not going to leave you with nothing but context management problems. Let me walk you through one solution among many!

We'll use LangChain and LangGraph to implement the following process:

we initialize a simple conversational graph with Llama3.1
once a certain number of tokens has been consumed in the conversation, we trigger a logic that will:
- summarize the entire previous conversation using the same LLM
- re-inject this summary into the conversation
- delete the other messages to reduce context, keeping only the conversation summary and the user's latest message
continue the conversation

For this example, I chose to use llama3.1:8b because:

it can run on consumer-grade machines privately
it supports 128K tokens of context in theory, but its performance degrades progressively beyond 8K tokens, becoming barely usable at 96K tokens; this is a perfect illustration of the problem we're addressing

Let's say we want to maintain optimal performance, in terms of user experience, with this model => we'll therefore ensure the context never exceeds ~8K tokens.

Let's start by importing the necessary dependencies and creating the base node of our LangGraph graph: the one that calls the LLM.

This simple node already contains the following logic:

if a summary exists in the current graph state, it's added to the conversation message list
the number of tokens consumed (whether incoming or outgoing) is updated after each round of conversation involving the LLM

Now, let's define the node that will:

generate the conversation summary using the same LLM (we could use a different, more capable model for retrieval and summarization tasks)
delete all conversation messages except 2: the conversation summary so far, and the user's last interaction

Now, we're ready to build our graph:

... note here the use of a memory checkpointer, which maintains state persistence during program execution.

Here is a visualization of our graph:

We can now execute it as follows:

Here, initially, I chat with Llama3.1 and the summary feature isn't triggered on the first three messages, as we can see here:

... but the moment I ask it to generate long content (in this case, a summary of Destiny's Child's and Beyonce's career), the following happens:

a conversation summary is created
the old messages are deleted from the conversation, replaced by the summary
we also display the total tokens consumed

Here's the enormous response containing the history of Beyonce's career:

... but after that, our summary generation function kicks in and we end up with this much shorter text =>

Here's a summary of our conversation:

We had a fun conversation about Destiny's Child and Beyonce's solo career. I provided a detailed history of the group, including their formation in 1990, early success with albums "Destiny's Child" (1998) and "The Writing's on the Wall" (1999), lineup changes, and eventual disbandment.

I also delved into Beyonce's solo career, highlighting her debut album "Dangerously in Love" (2003), subsequent albums "B'Day" (2006), "I Am... Sasha Fierce" (2008), "4" (2011), and "Beyonce" (2013). We discussed her visual album releases, including "Beyonce" (2013) and "Lemonade" (2016).

We touched on Beyonce's social background, influences, and fans, as well as her numerous awards and accolades. I also mentioned the Knowles family, Destiny's Child's dedicated fan base (the "Destiny's Child Army"), and Beyonce's massive following (the "Beyhive").

Overall, our conversation provided a comprehensive overview of Destiny's Child and Beyonce's solo career, covering their music, legacy, and impact on the entertainment industry.

We now have a fresh conversation of only 256 tokens and we can maintain peak performance with our little Llama3.1 8B!

We've implemented a simple context compression technique this way, but many others exist!

Reboot Loves LangChain and Agentic Applications!

We've written several articles on the fascinating LangChain stack in this blog, feel free to check them out:

test-drive of LangChain
the LangChain ecosystem
getting started with LangServe
introduction to the ReAct pattern with LangGraph
building an article generator with LangGraph

See you soon!

Long Context Management and LLM Usage: A Reality Check with a Mitigation Example

Tokens and Orders of Magnitude

Distinguishing Between Task Types

Practical Tips and Checklist

Context Management Example: Conversational Use Case with Llama3.1

Reboot Loves LangChain and Agentic Applications!

Similar articles

N8N, What's That All About?

How AI Is Revolutionizing Marketing (Without Replacing You)

AI Training Needs Assessment Framework: A Guide for HR Directors and Managers

Newsletter

Go further

Diagnostic innovation pour un cabinet de conseil QVT/RSE

Structuration et innovation d'un organisme de formation sécurité

Reboot Junior : une idée en une soirée, un projet en ligne en 12 heures

N8N, c'est quoi ce truc ?

Comment l'IA révolutionne le marketing (sans vous remplacer)

Grille d'évaluation des besoins de formation IA : Le guide pour DRH et Managers

Long Context Management and LLM Usage: A Reality Check with a Mitigation Example

Tokens and Orders of Magnitude

Distinguishing Between Task Types

Practical Tips and Checklist

Context Management Example: Conversational Use Case with Llama3.1

Reboot Loves LangChain and Agentic Applications!

Similar articles

N8N, What's That All About?

How AI Is Revolutionizing Marketing (Without Replacing You)

AI Training Needs Assessment Framework: A Guide for HR Directors and Managers

Newsletter

Go further

Diagnostic innovation pour un cabinet de conseil QVT/RSE

Structuration et innovation d'un organisme de formation sécurité

Reboot Junior : une idée en une soirée, un projet en ligne en 12 heures

N8N, c'est quoi ce truc ?

Comment l'IA révolutionne le marketing (sans vous remplacer)

Grille d'évaluation des besoins de formation IA : Le guide pour DRH et Managers