Context editing is interesting because most agents work on the assumption that KV cache is the most important thing to optimise and are very hesitant to remove parts of the context during work. It also sometimes introduces hallucinations, because parts of the context are with the assumption that eg tool results are there, but theyre not. Example Manus [0]. Eg, read file A, make changes on A. Then prompt on some more changes. If you now remove the "read file A" tool results, not only you break the cache, but in my own agent implementations(on gpt 5 at least) can hallucinate now since my prompt etc all naturally point to the content of the tool still beeing there.
Plus, the model got trained and RLed with a continuous context, except if they now tune it with messing with the context as well.
I'll address something else: Fomo is usually a symptom of an underlying deeper issue. Long before Anthropic/Openai, we had these same posts about people desperately wanting to get into Google. They got unhealthy obsessed about this goal that they started prepping for months, even years, documenting their journey on blogs only to get rejected by someone who spent 2 seconds on their application. Getting in is more about luck than most people realize. Once you're in, you had for a long time a skewed romanticized fantasy of what it's like to work at this mythical company (Anthropic isn't a startup by any stretch anymore) only to crash hard and realize it is a classic corporate environment.
The pace is so fast, if you have FOMO you've already missed out most probably. If you're interested in LLM flavored RL, I'd suggest prime-rl (and their discord) community, hugging face RL courses with smol (you'll need pro and burn a couple of bucks), etc. etc.
Willing to burn few bucks here and there for the projects.
Really need to get hands dirty here. I remember taking RL course from coursera during 2020 covid. I didn't have the chance to apply it in the problems I worked post covid.
But I really want to start doing RL again. Interested in world models and simulation for RL.
I’m trying to understand what part of this is something we could not have hacked together already as clients? Maybe new sonnet is rl’ed to be able to use these memories in a better way?
Nice. When using OpenAI Codex CLI, I find the /compact command very useful for large tasks. In a way it's similar to the context editing tool. Maybe I can ask it to use a dedicated directory to simulate the memory tool.
That’s powerful. Most of the differences I can see between AI generated output and human output comes from the « broad but specific » context of the task. I mean company culture, organization rules and politics, larger team focus and way of working.
It may take time to build the required knowledge bases but it must be worth it
I am working on a World Atlas based approach to computer use agents. If the task and app environment are reused, building an atlas of states and policies might be better than observe-plan-execute. We don't rediscover from scratch how an app works every time we use it.
We try to solve a similar problem to put long documents in context. We built an MCP for Claude to allow you to put long PDFs in your context window that go beyond the context limits: https://pageindex.ai/mcp.
this is about difficulty thing, it is about maintaining knowledge during long execution
what is IMO exiting is long term memory in things like Claude code, where model could learn your preferences as you collaborate. (there is already some hard disabled implementation in CC)
I had this same thought. I’m not entirely following how I’m to differentiate between these two things. I guess the API is to create my own CC type agent. But I’ve heard of folks creating agents that are CC based as well.
At Zenning AI, a generalist AI designed to replace entire jobs with just prompts. Our agents typically run autonomously for hours, so effective context management is critical. I'd say that we invest most of our engineering effort into what is ultimately context management, such as:
1. Multi-agent orchestration
2. Summarising and chunking large tool and agent responses
3. Passing large context objects by reference between agents and tools
Two things to note that might be interesting to the community:
Firstly, when managing context, I recommend adding some evals to our context management flow, so you can measure effectiveness as you add improvements and changes.
For example, our evals will measure the impact of using Anthropics memory over time. Thus allowing our team to make a better informed decisions on that tools to use with our agents.
Secondly, there's a tradeoff not mentioned in this article: speed vs. accuracy. Faster summarisation (or 'compaction') comes at a cost of accuracy. If you want good compaction, it can be slow. Depending on the use case, you should adjust your compaction strategy accordingly. For example, (forgive my major generalisation), for consumer facing products speed is usually preferred over a bump in accuracy. However, in business accuracy is generally preferred over speed.
Context editing is interesting because most agents work on the assumption that KV cache is the most important thing to optimise and are very hesitant to remove parts of the context during work. It also sometimes introduces hallucinations, because parts of the context are with the assumption that eg tool results are there, but theyre not. Example Manus [0]. Eg, read file A, make changes on A. Then prompt on some more changes. If you now remove the "read file A" tool results, not only you break the cache, but in my own agent implementations(on gpt 5 at least) can hallucinate now since my prompt etc all naturally point to the content of the tool still beeing there.
Plus, the model got trained and RLed with a continuous context, except if they now tune it with messing with the context as well.
https://manus.im/blog/Context-Engineering-for-AI-Agents-Less...
I want to really get into anthropic.
For context: I have background in CV and ML in general. Currently reviewing and revising RL.
Any idea how I can get into RL?
I have 3 years of industry/research experience.
Whenever I see post like this, it triggers a massive fomo creating a scene of urgency on I should work in these problems.
Not being able to work here is making be anxious.
what does it take for someone in Non-US/Non-EU region to get into big labs such as these?
Do I really have to pursue PhD? I am already old that pursuing PhD is a huge burden that I can't afford.
I'll address something else: Fomo is usually a symptom of an underlying deeper issue. Long before Anthropic/Openai, we had these same posts about people desperately wanting to get into Google. They got unhealthy obsessed about this goal that they started prepping for months, even years, documenting their journey on blogs only to get rejected by someone who spent 2 seconds on their application. Getting in is more about luck than most people realize. Once you're in, you had for a long time a skewed romanticized fantasy of what it's like to work at this mythical company (Anthropic isn't a startup by any stretch anymore) only to crash hard and realize it is a classic corporate environment.
The pace is so fast, if you have FOMO you've already missed out most probably. If you're interested in LLM flavored RL, I'd suggest prime-rl (and their discord) community, hugging face RL courses with smol (you'll need pro and burn a couple of bucks), etc. etc.
Willing to burn few bucks here and there for the projects.
Really need to get hands dirty here. I remember taking RL course from coursera during 2020 covid. I didn't have the chance to apply it in the problems I worked post covid.
But I really want to start doing RL again. Interested in world models and simulation for RL.
I’m trying to understand what part of this is something we could not have hacked together already as clients? Maybe new sonnet is rl’ed to be able to use these memories in a better way?
Nice. When using OpenAI Codex CLI, I find the /compact command very useful for large tasks. In a way it's similar to the context editing tool. Maybe I can ask it to use a dedicated directory to simulate the memory tool.
Claude Code already compacts automatically.
That’s powerful. Most of the differences I can see between AI generated output and human output comes from the « broad but specific » context of the task. I mean company culture, organization rules and politics, larger team focus and way of working. It may take time to build the required knowledge bases but it must be worth it
Interestingly we rolled out a similar feature recently.
I am working on a World Atlas based approach to computer use agents. If the task and app environment are reused, building an atlas of states and policies might be better than observe-plan-execute. We don't rediscover from scratch how an app works every time we use it.
Edited version:
We try to solve a similar problem to put long documents in context. We built an MCP for Claude to allow you to put long PDFs in your context window that go beyond the context limits: https://pageindex.ai/mcp.
Just a heads-up: HN folks value transparency, so mentioning if it's yours usually builds more trust.
Thanks for the reminder, I have edited the comment.
this is about difficulty thing, it is about maintaining knowledge during long execution what is IMO exiting is long term memory in things like Claude code, where model could learn your preferences as you collaborate. (there is already some hard disabled implementation in CC)
so this is what claude code 2 uses under the hood? at least i got the impression it stays much better on track than the old version
Why are both this new Memory API and the Filesystem as (evolving) Context releases only for the Developer API - but not integrated into Claude Code?
I had this same thought. I’m not entirely following how I’m to differentiate between these two things. I guess the API is to create my own CC type agent. But I’ve heard of folks creating agents that are CC based as well.
Hopefully one day Anthropic will allow zipfile uploads like ChatGPT and Gemini have allowed for ages.
At Zenning AI, a generalist AI designed to replace entire jobs with just prompts. Our agents typically run autonomously for hours, so effective context management is critical. I'd say that we invest most of our engineering effort into what is ultimately context management, such as:
1. Multi-agent orchestration 2. Summarising and chunking large tool and agent responses 3. Passing large context objects by reference between agents and tools
Two things to note that might be interesting to the community:
Firstly, when managing context, I recommend adding some evals to our context management flow, so you can measure effectiveness as you add improvements and changes.
For example, our evals will measure the impact of using Anthropics memory over time. Thus allowing our team to make a better informed decisions on that tools to use with our agents.
Secondly, there's a tradeoff not mentioned in this article: speed vs. accuracy. Faster summarisation (or 'compaction') comes at a cost of accuracy. If you want good compaction, it can be slow. Depending on the use case, you should adjust your compaction strategy accordingly. For example, (forgive my major generalisation), for consumer facing products speed is usually preferred over a bump in accuracy. However, in business accuracy is generally preferred over speed.