You have a stateless AI model and you need it to act like it remembers things. Maybe you are building a support bot, a writing assistant, or an internal tool, and the model keeps losing the thread between messages. This guide gives you a concrete, ordered process to fix that, starting from a model that remembers nothing and ending with one that holds both short-term and long-term memory.
We will not theorize. Each step is something you can implement directly, in the order presented, with the reasoning made explicit so you know why you are doing it. You do not need to do all six steps for every project. Stop at the level of memory your use case actually requires; over-engineering memory is its own kind of mistake.
Follow these steps in sequence. Each one builds on the last, and skipping ahead tends to produce systems that work in a demo and fall apart in production.
Step 1: Confirm the model truly forgets
Before building anything, prove the starting condition to yourself. Send a message to the model, then in a separate request ask it to recall what you just said. It will not be able to. This is your baseline.
Doing this firsthand matters because it sets the correct mental model. You are not patching a forgetful AI; you are responsible for every piece of memory it will ever have. Once you accept that the model is a pure function, text in and text out, the rest of the process makes sense.
What you are confirming
- Each API call is independent and shares nothing with prior calls.
- Whatever the model "knows" in a request came entirely from that request's text.
- All memory work happens in your code, not the model.
Step 2: Pass the conversation history with every call
The first real step is the simplest. Keep a list of the messages exchanged, and include that full list in each new request. Most chat APIs accept an array of prior messages for exactly this purpose.
Now the model can answer follow-up questions, because the earlier turns are right there in the input it receives. This alone gets you a coherent multi-turn conversation and covers a surprising number of use cases.
Implementation notes
- Store messages in a simple structure: role plus content, in order.
- Append the model's reply to the list after each turn so the next call includes it.
- Keep a clear system message at the top to anchor behavior.
If you stop here, you have a working chat. The next steps exist only because conversations grow.
Step 3: Measure your token budget and set a limit
Every request you send consumes tokens, and the model has a maximum it can read, the context window. As the conversation grows, you march toward that ceiling. You need to know where you stand.
Count the tokens in your accumulated history before each call. Decide on a threshold, comfortably below the model's hard limit, that triggers your compression logic. Leaving headroom matters because the model also needs room to write its response.
Why set the limit below the maximum
The context window is shared by your system instructions, the conversation, any retrieved data, and the model's answer. If you fill it entirely with history, the model has no room to respond. Treat the window as a budget and reserve a slice for output. Mismanaging this budget is one of the common mistakes we document.
Step 4: Summarize older turns when you hit the threshold
When the conversation crosses your token threshold, do not just truncate blindly. Instead, take the oldest portion of the history and ask the model to compress it into a concise summary. Replace those raw messages with the summary.
The result is a rolling memory: recent turns stay verbatim for fidelity, while older context lives on as a recap. You preserve continuity without blowing the budget. This is the standard pattern for conversations that need to run long.
Keep the summary honest
- Instruct the summarizer to preserve names, decisions, and commitments, not just gist.
- Regenerate the summary periodically rather than summarizing summaries repeatedly, which degrades quality.
- Keep the most recent few turns raw; summaries lose the nuance that fresh text carries.
Step 5: Add long-term memory with retrieval
Summaries handle a single long conversation, but they vanish when the session ends. For memory that survives across sessions, store durable facts externally and pull them in when relevant.
Create a store, often a vector database, where you save discrete facts, documents, or past exchanges. When a new message arrives, search that store for the most relevant items and inject them into the prompt. The model now appears to recall information from days or weeks ago.
The retrieval loop, in order
- Receive the user's message.
- Search your store for content relevant to that message.
- Insert the top matches into the prompt alongside the recent conversation.
- Send to the model and return the answer.
- Optionally write new facts from this exchange back into the store.
This loop is the heart of long-term AI memory. Our reusable framework formalizes how these layers fit together so you can apply them consistently.
Step 6: Test the seams and tune relevance
With all layers in place, the failures move to the boundaries: history that gets trimmed too aggressively, summaries that drop a key fact, retrieval that returns noise instead of signal. Test each seam deliberately.
Run long conversations and check that early commitments survive. Ask cross-session questions and verify the right facts surface. Tune how many retrieved items you include, because more is not better; irrelevant context buries the signal the model needs.
A simple test checklist
- Does the model recall a fact stated 30 turns ago? (Tests summarization.)
- Does it recall a fact from a previous session? (Tests retrieval.)
- Does it stay on-task when irrelevant documents exist in the store? (Tests relevance tuning.)
For a working tool you can run through before shipping, see our memory checklist for 2026.
A worked example: putting the steps together
To make the sequence concrete, walk through how a simple customer-support assistant moves through all six steps. The point is to see how each step earns its place rather than appearing as abstract advice.
You begin by confirming the model forgets (step 1), which sets your expectations. You implement history-passing (step 2), and the assistant can now follow a multi-turn troubleshooting conversation. That alone handles short tickets. But support conversations sometimes run long, so you add token counting (step 3) and discover that around a certain length you are approaching the context limit.
At that point you introduce summarization (step 4), compressing the early diagnostic back-and-forth while keeping the most recent exchanges verbatim, so the assistant never loses the thread on a long ticket. Then you realize returning customers expect the assistant to remember their account and past issues, which a single session cannot provide, so you add retrieval (step 5) backed by a small profile store. Finally, you test the seams (step 6) by simulating a long conversation from a returning customer and confirming both the in-session summary and the cross-session profile behave.
What the example teaches
- Each step was added in response to a real limitation, not preemptively.
- Stopping early would have been fine for a simpler product; the long-ticket and returning-customer needs drove the later steps.
- The seams, where summarization meets retrieval meets fresh input, are exactly where final testing focuses.
This is the intended rhythm: implement the minimum, observe where it breaks, and add the next layer deliberately. Build it this way and you end up with exactly the memory your feature needs and no more.
Frequently Asked Questions
Do I need all six steps for every project?
No. A simple bot that handles short, self-contained questions may only need step 2. Add summarization when conversations run long, and retrieval only when you need memory that persists across sessions. Match the effort to the requirement.
How do I count tokens before sending a request?
Most providers publish a tokenizer library or endpoint that converts text to a token count. Run your accumulated messages through it before each call. Counting matters because tokens, not characters or words, determine whether you fit inside the context window.
What goes in the external memory store?
Durable facts worth recalling later: user preferences, key decisions, reference documents, and summaries of past sessions. Avoid dumping entire raw conversations, which makes retrieval noisy. Store discrete, meaningful units that you can match against future questions.
Why is too much retrieved context a problem?
The model has limited attention and a fixed window. Flooding the prompt with marginally relevant material crowds out the messages that matter and can degrade the answer. Retrieving fewer, higher-quality items usually outperforms retrieving many.
Key Takeaways
- Start by confirming the model is stateless so you accept full ownership of its memory.
- Pass the conversation history each turn for basic multi-turn coherence.
- Track your token budget and trigger compression before you hit the context limit.
- Summarize older turns for long sessions; use external retrieval for memory across sessions.
- Test the seams and tune relevance, because the hardest failures live at the boundaries between layers.