Your agent forgets things it should remember.
Remembers things it should have forgotten. Confuses context from different users. Starts every session cold even when the user has been interacting with your product for months.
The fix most teams reach for: a bigger context window.
The actual fix: designing memory intentionally across four distinct layers that serve completely different purposes.
Why "just use a bigger context window" fails
A larger context window does not solve the memory problem. It delays it.
Stuffing everything into the context window works until it doesn't — until the window fills up, until retrieval becomes noisy because there is too much to attend to, until costs become unsustainable at scale, until the model starts losing track of information that appeared three thousand tokens ago.
More importantly, it confuses four different problems that each need their own solution.
The four layers
Layer 1: Working memory
This is your context window — what the agent can actively reason over right now. Fast, flexible, and bounded. Every agent has this layer. Most agents treat it as the only layer.
The design question for working memory is not how big should it be — it is what belongs here and what should be evicted.
Working memory should contain what the agent needs to complete the current task: the immediate goal, the last few exchanges, the current tool results, and nothing else. Everything older, larger, or less relevant belongs in one of the other layers.
A working memory that holds everything eventually holds nothing useful. The agent loses the thread. Responses become less coherent. Retrieval becomes noisy. You add more tokens to compensate. The problem compounds.
Layer 2: Episodic memory
What happened in this session and in recent sessions. Not raw conversation history — structured, summarised state.
What did the user ask for? What did the agent try? What was the outcome? What does the user seem to prefer? What should the agent remember for next time?
This is the layer that makes an agent feel like it knows you after the first interaction. Without it, every session is a first meeting. The user repeats context they have already provided. The experience degrades. They stop trusting the system.
Episodic memory lives outside the context window — in a database, structured and queryable. It gets retrieved selectively at the start of a session, not dumped wholesale into the prompt.
The design question: at what granularity do you store episodes? At what point does an episode expire, get archived, or get compressed into a longer-term user profile?
Layer 3: Semantic memory
Your retrieved knowledge. Documents, product information, policies, anything the agent needs to answer questions accurately. This is where RAG lives.
Most teams build this layer. Fewer connect it properly to the others.
The critical connection: the planner needs to know when to query semantic memory, what to query it with, and how much retrieved context to bring into working memory for this specific task. That routing decision — which most teams leave implicit or hardcoded — is where the majority of retrieval quality problems actually originate.
Semantic memory is not a static store. It should be updated as the world changes and as your product evolves. A system that retrieves from a knowledge base that is three months stale is not a reliable system.
Layer 4: Procedural memory
How to do things — successful tool call patterns, learned strategies for similar tasks, user preferences for how things get done.
This is the rarest layer and the one that makes an agentic system feel genuinely intelligent over time. Instead of rediscovering the right approach to a recurring task, the system retrieves what worked before and starts from there.
Most teams do not build this layer at all. It is genuinely harder — it requires deciding what constitutes a "successful strategy," how to store it in a retrievable form, and when to apply past strategies versus explore new ones.
But the teams that build it have agents that improve with use instead of staying flat.
The connection between layers
The layers are not independent. They work together through a memory manager — a component whose job is deciding what goes where and what comes back when.
At the start of a task: working memory is populated with a combination of episodic context (what the user has done before), relevant semantic memory (what the agent needs to know), and procedural context (how similar tasks have been handled).
During a task: working memory is maintained actively — old context is evicted, new results are added, the current goal stays prominent.
At the end of a task: significant events are written to episodic memory. Successful strategies are evaluated for procedural storage.
This cycle — populate, maintain, persist — is what makes memory feel coherent rather than accidental.
How to build this incrementally
You do not need all four layers on day one.
Start with working memory and episodic memory. Together they solve the majority of "my agent feels stateless" complaints. Users feel remembered. Sessions feel connected.
Add semantic memory when you have a knowledge domain worth retrieving from. This is when RAG enters the system — designed explicitly as a memory layer, not bolted on as a feature.
Add procedural memory when you have enough usage data to identify recurring task patterns. This is a later optimisation, not an early requirement.
The key constraint at every stage: each layer should have a single owner in your codebase. One component reads from and writes to episodic memory. One component manages working memory. One component handles retrieval from semantic storage.
Mixing these responsibilities — like mixing planning and execution in orchestration — is how you get a memory system that is impossible to debug and impossible to improve without breaking something unintended.
The question worth asking now
Look at your current agentic system and ask: if a user comes back tomorrow and picks up where they left off, what does the agent actually know?
The answer tells you which layer you are missing.