interesting to see feature launches are coming via official website while usage restrictions are coming in with a team member's twitter account - https://x.com/trq212/status/2037254607001559305.
Funnily, Anthropic's pricing etc. why I'm using GLM-5 a bunch more outside of work. Definitely not Opus level, but surprisingly decent. Though I got lucky and got the Alibaba Coding Model lite plan, which is so cheap they got rid of it
I've been doing something similar. I use Claude for analysis and non-coding work, GLM for most coding tasks (GLM's coding plan) and when I need to do a larger implementation project I use GLM&Claude to build out an in depth plan and toss it to Github Copilot to Opus the implementation.
I was trying to get The alibaba plan but missed the mark. I'm curious to try out the Minimax coding plan (#10/mo) or Kimi ($20/mo) at some point to see how they stack up.
For Pricing: GLM was $180 for a year of their pro tier during a black friday sale and GHCP was $100/year but they don't have the annual plan any more so it is now $120. Alibaba's only coding plan today is $50/mo, too rich for me.
And if you look closely at the usernames, you see that the same engineer from link 2 that said "nah it’s just a bonus 2x, it’s not that deep" (just two week ago) is now saying "we're going to throttle you during peak hours" (as predicted).
Yes, it was FUD, but ended up being correct. With the track record that Anthropic has (e.g. months long denial of dumbed down models last year, just to later confirm it as a "bug"), this just continues to erode trust, and such predictions are the result of that.
Anthropic fixing that bug way faster than Apple fixing iOS keyboard "bug". Anthropic even acknowledged it, Apple gave us the silent treatment for years.
I'm not sure it's a rug pull when their stats show 7% and 2% subscription-level impacts. We're back in the ISP days, and they never said unlimited.
I feel like we are just inching closer and closer to a world where rapid iteration of software will be by default. Like for example a trusted user makes feedback -> feedback gets curated into a ticket by an AI agent, then turned into a PR by an Agent, then reviewed by an Agent, before being deployed by an Agent. We are maybe one or two steps from the flywheel being completed. Or maybe we are already there.
I just don’t see it coming. I was full on that camp 3 months ago, but I just realize every step makes more mistakes. It leads into a deadlock and when no human has the mental model anymore.
Don’t you guys have hard business problems where AI just cant solve it or just very slowly and it’s presenting you 17 ideas till it found the right one. I’m using the most expensive models.
I think the nature of AI might block that progress and I think some companies woke up and other will wake up later.
The mistake rate is just too high. And every system you implement to reduce that rate has a mistake rate as well and increases complexity and the necessary exploration time.
I think a big bulk of people is of where the early adaptors where in December. AI can implement functional functionality on a good maintained codebase.
But it can’t write maintable code itself. It actually makes you slower, compared to assisted-writing the code, because assisted you are way more on the loop and you can stop a lot of small issues right away. And you fast iterate everything•
I’ve not opened my idea for 1 months and it became hell at a point. I’ve now deleted 30k lines and the amount of issues I’m seeing has been an eye-opening experience.
Unscalable performance issues, verbosity, straight up bugs, escape hatches against my verification layers, quindrupled types.
Now I could monitor the ai output closer, but then again I’m faster writing it myself. Because it’s one task. Ai-assisted typing isn’t slower than my brain is.
Also thinking more about it FAANG pays 300$ per line in production, so what do we really trying to achieve here, speed was never the issue.A great coder writes 10 production lines per day.
Accuracy, architecture etc is the issue. You do that by building good solid fundamental blocks that make features additions easier over time and not slower
I know it’s not your main point, but I’m curious where $300/line comes from. I don’t think I’ve ever seen a dollar amount attached to a line of production code before.
I think this sounds like a true yet short sighted take. Keep in mind these features are immature but they exist to obtain a flywheel and corner the market. I don’t know why but people seem to consistently miss two points and their implications
- performance is continuing to increase incredibly quickly, even if you rightfully don’t trust a particular evaluation. Scaling laws like chinchilla and RL scaling laws (both training and test time)
- coding is a verifiable domain
The second one is most important. Agent quality is NOT limited by human code in the training set, this code is simply used for efficiency: it gets you to a good starting point for RL.
Claiming that things will not reach superhuman performance, INCLUDING all end to end tasks: understanding a vague business objective poorly articulated, architecting a system, building it out, testing it, maintaining it, fixing bugs, adding features, refactoring, etc. is what requires the burden of proof because we literally can predict performance (albeit it has a complicated relationship with benchmarks and real world performance).
Yes definitely, error rates are too high so far for this to be totally trusted end to end but the error rates are improving consistently, and this is what explains the METR time horizon benchmark.
Scaling laws vs combinatorial explosion, who wins? In personal experience claude does exceedingly well on mundane code (do a migration, add a field, wire up this UI) and quite poorly on code that has likely never been written (even if it is logically simple for a human). The question is whether this is a quantitative or qualitative barrier.
Of course it's still valuable. A real app has plenty of mundane code despite our field's best efforts.
Combinatorial explosion? What do you mean? Again, your experiences are true, but they are improving with each release. The error rate on tasks continues to go down, even novel tasks (as far as we can measure them). Again this is where verifiable domains come in -- whatever problems you can specify the model will improve on them, and this improvement will result in better generalization, and improvements on unseen tasks. This is what I mean by taking your observations of today, ignoring the rate of progress that got us here and the known scaling laws, and then just asserting there will be some fundamental limitation. My point is while this idea may be common, it is not at all supported by literature and the mathematics.
The space of programs is incomprehensibly massive. Searching for a program that does what you need is a particularly difficult search problem. In the general case you can't solve search, there's no free lunch. Even scaling laws must bow to NFL. But depending on the type of search problem some heuristics can do well. We know human brains have a heuristic that can program (maybe not particularly well, but passably). To evaluate these agents we can only look at it experimentally, there is no sense in which they are mathematically destined to eventually program well.
How good are these types of algorithms at generalization? Are they learning how to code; or are they learning how to code migrations, then learning how to code caches, then learning how to code a command line arg parser, etc?
Verifiable domains are interesting. It is unquestionably why agents have come first for coding. But if you've played with claude you may have experienced it short-circuiting failing tests, cheating tests with code that does not generalize, writing meaningless tests, and at long last if you turn it away from all of these it may say something like "honest answer - this feature is really difficult and we should consider a compromise."
So what do you think the difference is between humans and an agent in this respect? What makes you think this has any relevance to the problem? everything is combinatorially explosive: the combination of words that we can string into sentences and essays is also combinatorially explosive and yet LLMs and humans have no problem with it. It's just the wrong frame of thinking for what's going on. These systems are obtaining higher and higher levels of abstractions because that is the most efficient thing for them to do to gain performance. That's what reasoning looks like: compositions of higher level abstractions. What you say may be true but I don't see how this is relevant.
"There is no sense in which they are mathematically destined to eventually program well"
- Yes there is and this belies and ignorance of the literature and how things work
- Again: RL has been around forever. Scaling laws have held empirically up to the largest scales we've tested. There are known RL scaling laws for both training and test time. It's ludicrous to state there is "no sense" in this, on the contrary, the burden of proof of this is squarely on yourself because this has already been studied and indeed is the primary reason why we're able to secure the eye-popping funding: contrary to popular HN belief, a trillion dollars of CapEx spend is based on rational evidence-based decision making.
> "How good are these types of algorithms at generalization"
> Verifiable domains are interesting. It is unquestionably why agents have come first for coding. But if you've played with claude you may have experienced it short-circuiting failing tests, cheating tests with code that does not generalize, writing meaningless tests, and at long last if you turn it away from all of these it may say something like "honest answer - this feature is really difficult and we should consider a compromise."
You say this and ignore my entire argument: you are right about all of your observations, yet
- Opus 4.6 compared to Sonnet 3.x is clearly more generalizable and less prone to these mistakes
- Verifiable domain performance SCALES, we have no reason to expect that this scaling will stop and our recursive improvement loop will die off. Verifiable domains mean that we are in alphago land, we're learning by doing and not by mimicking human data or memorizing a training set.
Hey man, it sounds like you're getting frustrated. I'm not ignoring anything; let's have a reasonable discussion without calling each other ignorant. I don't dispute the value of these tools nor that they're improving. But the no free lunch theorem is inexorable so the question is where this improvement breaks down - before or beyond human performance on programming problems specifically.
What difference do I think there is between humans and an agent? They use different heuristics, clearly. Different heuristics are valuable on different search problems. It's really that simple.
To be clear, I'm not calling either superior. I use agents every day. But I have noticed that claude, a SOTA model, makes basic logic errors. Isn't that interesting? It has access to the complete compendium of human knowledge and can code all sorts of things in seconds that require my trawling through endless documentation. But sometimes it forgets that to do dirty tracking on a pure function's output, it needs to dirty-track the function's inputs.
What is unreasonable? I am saying the claims you are making are completely contradicted by the literature. I am calling you ignorant in the technical sense, not dumb or unintelligent, and I don't mean this as an insult. I am completely ignorant of many things, we all are.
I am saying you are absolutely right that Opus 4.6 is both SOTA and also colossally terrible in even surprisingly mundane contexts. But that is just not relevant to the argument you are making which is that there is some fundamental limitation. There is of course always a fundamental limitation to everything, but what we're getting at is where that fundamental limitation is and we are not yet even beginning to see it. Combinatorics here is the wrong lens to look at this, because it's not doing a search over the full combinatoric space, as is the case with us. There are plenty of efficient search "heuristics" as you call them.
> It's interesting that you mention AlphaGo. I was also very fascinated with it. There was recent research that the same algorithm cannot learn Nim: https://arstechnica.com/ai/2026/03/figuring-out-why-ais-get-.... Isn't that food for thought?
It's a long known problem with RL in a particular regime and isn't relevant to coding agents. Things like Nim are a small, adversarially structured task family and it's not representative of language / coding / real-world tasks. Nim is almost the worst possible case, the optimal optimal policy is a brittle, discontinuous function.
Alphago is pure RL from scratch, this is quite challenging, inefficient, and unstable, and why we dont do that with LLMs, we pretrain them first. RL is not used to discover invariants (aspects of the problem that don't change when surface details change) from scratch in coding agents as they are in this example. Pretraining takes care of that and RL is used for refinement, so a completely different scenario where RL is well suited.
I didn't make any claims contradicted by literature. The only thing I cited as bedrock fact, NFL, is a mathematical theorem. I'm not sure why Nim shouldn't be relevant, it's an exercise in logic.
> “AlphaZero excels at learning through association,” Zhou and Riis argue, “but fails when a problem requires a form of symbolic reasoning that cannot be implicitly learned from the correlation between game states and outcomes.”
> So what do you think the difference is between humans and an agent in this respect?
Humans learn.
Agents regurgitate training data (and quality training data is increasingly hard to come by).
Moreover, humans learn (somewhat) intangible aspects: human expectations, contracts, business requirements, laws, user case studies etc.
> Verifiable domain performance SCALES, we have no reason to expect that this scaling will stop.
Yes, yes we have reasons to expect that. And even if growth continues, a nearly flat logarithmic scale is just as useless as no growth at all.
For a year now all the amazing "breakthrough" models have been showing little progress (comparatively). To the point that all providers have been mercilessly cheating with their graphs and benchmarks.
> Where did I say that? I didn’t even mention money, just the broader resource term. A lot of business are mostly running experiments if the current set of tooling can match the marketing (or the hype). They’re not building datacenters or running AI labs. Such experiments can’t run forever.
I'm just going to ask that you read any of my other comments, this is not at all how coding agents work and seems to be the most common misunderstanding of HN users generally. It's tiring to refute it. RL in verifiable domains does not work like this.
> Humans learn.
Sigh, so do LLMs, in context.
> Moreover, humans learn (somewhat) intangible aspects: human expectations, contracts, business requirements, laws, user case studies etc.
Literally benchmarks on this all over the place, I'm sure you follow them.
> Yes, yes we have reasons to expect that. And even if growth continues, a nearly flat logarithmic scale is just as useless as no growth at all.
and yet its not logarithmic? Consider data flywheel, consistent algorithmic improvements, synthetic data [basically: rejection sampling from a teacher model with a lot of test-time compute + high temperature],
> For a year now all the amazing "breakthrough" models have been showing little progress (comparatively). To the point that all providers have been mercilessly cheating with their graphs and benchmarks.
Benchmaxxing is for sure a real thing, not to mention even honest benchmarking is very difficult to do, but considering "all of the AI companies are just faking the performance data" to be the "story" is tremendously wrong. Consider AIME performance on 2025 (uncontaminated data), the fact that companies have a _deep incentive_ to genuinely improve their models (and then of course market it as hard as possible, thats a given). People will experiment with different models, and no benchmaxxing is going to fool people for very long.
If you think Opus 4.6 compared to Sonnet 3.x is "little progress" I think we're beyond the point of logical argument.
You're missing the point though. "1 + 1" vs "one.add(1)" might both be "passable" and correct, but it's missing the forest for the trees, how do you know which one is "long-term the right choice, given what we know?", which is the engineering part of building software, and less about "coding" which tends to be the easy part.
How do you evaluate, score and/or benchmark something like that? Currently, I don't think we have any methodologies for this, probably because it's pretty subjective in the end. That's where the "creative" parts of software engineering becomes more important, and it's also way harder to verify.
While I agree we don't have any methodologies for this, it's also true that we can just "fail" more often.
Code is effectively becoming cheap, which means even bad design decisions can be overturned without prohibitive costs.
I wouldn't be surprised if in a couple of years we see several projects that approach the problem of tech debt like this:
1. Instruct AI to write tens of thousands of tests by using available information, documentation, requirements, meeting transcripts, etc. These tests MUST include performance AND availability related tests (along with other "quality attribute" concerns)
2. Have humans verify (to the best of their ability) that the tests are correct -- step likely optional
3. Ask another AI to re-implement the project while matching the tests
It sounds insane, but...not so insane if you think we will soon have models better than Opus 4.6. And given the things I've personally done with it, I find it less insane as the days go by.
I do agree with the original poster who said that software is moving in this direction, where super fast iteration happens and non-developers can get features to at least be a demo in front of them fast. I think it clearly is and am working internally to make this a reality. You submit a feature request and eventually a live demo is ready for you, deployed in isolation at some internal server, proxied appropriately if you need a URL, and ready for you to give feedback and have the AI iterate on it. Works for the kind of projects we have, and, though I get it might be trickier for much larger systems, I'm sure everyone will find a way.
For now, we still need engineers to help drive many decisions, and I think that'll still be the case.These days all I do when "coding" is talking (via TTS) with Opus 4.6 and iterating on several plans until we get the right one, and I can't wait to see how much better this workflow will be with smarter and faster models.
I'm personally trying to adapt everything in our company to have agents work with our code in the most frictionless way we can think of.
Nonetheless, I do think engineers with a product inclination are better off than those who are mostly all about coding and building systems. To me, it has never felt so magical to build a product, and I'm loving it.
> Code is effectively becoming cheap, which means even bad design decisions can be overturned without prohibitive costs.
I'm sorry, but only someone who never maintained software long-term would say something like this. The further along you are in development, the magnitude of costs related to changing that increases, maybe even exponentially.
Correct the design before you even wrote code, might be 100x cheaper (or even 1000x) than changing that design 2 years later, after you've stored TBs of data in some format because of that decision, and lots of other parts of the company/product/project depends on those choices you made earlier.
You can't just pile on code on top of code, say "code is cheap" and hope for the best, it's just not feasible to run a project long-term that way, and I think if you had the experience of maintaining something long-term, you'd realize how this sounds.
The easiest part of "software engineering" is "writing code", and today "writing code" is even easier. But the hardest parts, actually designing, thinking and maintaining, remains the same as before, although some parts are easier, others are harder.
Don't get me wrong, I'm on the "agentic coding" train as much as everyone else, probably haven't written/edited a code by myself for a year at this point, but it's important to be realistic about what it actually takes to produce "worthwhile software", not just slop out patchy and hacky code.
I've never maintained software long-term so i could be wrong, but I interpret "code is cheap" to mean that you can have coding agents refactor or rewrite the project from scratch around the design correction. I don't think 'code is cheap' ever should be interpreted to mean ship hacky code.
I think using agents to prototype code and design will be a big thing. Have the agent write out what you want, come back with what works and what doesn't, write a new spec, toss out the old code and and have a fresh agent start again. Spec-driven development is the new hotness, but we know that the best spec is code, have the agent write the spec in code, rewrite the spec in natural language, then iterate.
because it has business context and better reasoning, and can ask humans for clarification and take direction.
You don't need to benchmark this, although it's important. We have clear scaling laws on true statistical performance that is monotonically related to any notion of what performance means.
I do benchmarks for a living and can attest: benchmarks are bad, but it doesn't matter for the point I'm trying to make.
I feel like you're missing the initial context of this conversation (no pun intended):
> Like for example a trusted user makes feedback -> feedback gets curated into a ticket by an AI agent, then turned into a PR by an Agent, then reviewed by an Agent, before being deployed by an Agent.
Once you add "humans for clarifications and take direction" then yeah, things can be useful, but that's far away from the non-human-involvment-loop earlier described in this thread, which is what people are pushing back against.
Of course, involving people makes things better, that's the entire point here, and that by removing the human, you won't get as good results. Going back to benchmarks, obviously involving humans aren't possible here, so again we're back to being unable to score these processes at all.
I'm confused on the scenario here. There is human in the loop, it's the feedback part...there is business context, it is either seeded or maintained by the human and expanded by the agent. The agent can make inferences about the world, especially when embodiment + better multimodal interaction is rolled out [embodiment taking longer].
Benchmarks ==> it's absolutely not a given that humans can't be involved in the loop of performance measurement. Why would that be the case?
> It doesn't because it doesn't learn. Every time you run it, it's a new dawn with no knowledge of your business or your business context
It does learn in context. And lack of continuous learning is temporary, that is a quirk of the current stack, expect this to change rather quickly. Also still not relevant, consider that agentic systems can be hierarchical and that they have no trouble being able to grok codebases or do internal searches effectively and this will only improve.
> It doesn't have better reasoning beyond very localized decisions.
Do you have any basis for this claim? It contradicts a large amount of direct evidence and measurement and theory.
> This is just a bunch of words stringed together, isn't it?
Maybe to yourself? Chinchilla scaling laws and RL scaling laws are measured very accurately based on next token test loss (Chinchilla). This scales very predictably. It is related to downstream performance, but that relationship is noisy but clearly monotonic
It also doesn't help that every new context is a new dawn with no knowledge if things past.
> Also still not relevant, consider that agentic systems can be hierarchical and that they have no trouble being able
A bunch of Memento guys directing a bunch of other Memento guys don't make a robust system, or a system that learns, or a system that maintains and retains things like business context.
> and this will only improve.
We've heard this mantra for quite some time now.
> Do you have any basis for this claim?
Oh. Just the fact that in every single coding session even on a small 20kloc codebase I need to spend time cleaning up large amounts of duplicated code, undo quite a few wrong assumptions, and correct the agent when it goes on wild tangents and goose hunts.
> Maybe to yourself? Chinchilla scaling laws a
yap yap yap. The result is anything but your rosy description of these amazing reasoning learning systems that handle business context.
> It also doesn't help that every new context is a new dawn with no knowledge if things past.
Absolutely true that it doesn't help but: agents like Claude have access to older sessions, they can grok impressive amounts of data via tool use, they can compose agents into hierarchical systems that effectively have much larger context lengths at the expense of cost and coordination which needs improvement. Again this is a temporary and already partially solved limitation
> A bunch of Memento guys directing a bunch of other Memento guys don't make a robust system, or a system that learns, or a system that maintains and retains things like business context.
I think you are not understanding: hierarchical agents have long term memory maintained by higher level agents in the hierarchy, it's the whole point. It's annoying to reset model context, but yet you have a knowledge base of the business context persisted and it can grok it...
> We've heard this mantra for quite some time now.
yes you have, and it has held true and will continue to hold true. Have you read the literature on scaling laws? Do you follow benchmark progression? Do you know how RL works? If you do I don't think you will have this opinion.
> yap yap yap. The result is anything but your rosy description of these amazing reasoning learning systems that handle business context.
Well that's fine to call an entire body of literature "yap" but don't pretend like you have some intelligible argument, I don't see you backing up any argument you have here with any evidence, unlike the multitude of sources I have provided to you.
Do you argue things have not improved in the last year with reasoning systems? If so I would really love to hear the evidence for this.
I love it when people include links to papers that refute their words.
So, Antropic (which is heavily reliant on hype and making models appear more than they are) authors a paper which clearly states: "tokens later in context are easier to predict and there's less loss of tokens. For no reason at all we decided to give this a new name, in-context learning".
> agents like Claude have access to older sessions, they can grok impressive amounts of data via tool use
That is they rebuild the world from scratch for every new session, and can't build on what was learned or built in the last one.
Hence continuous repeating failure modes.
10 years ago I worked in a team implementing royalties for a streaming service. I can still give you a bunch of details, including references to multiple national laws, about that. Agents would exhaust their context window just re-"learning" it from scratch, every time. And they would miss a huge amount of important context and business implications.
> Have you read the literature on scaling laws?
You keep referencing this literature as it was Holy Bible. Meanwhile the one you keep referring to, Chinchilla, clearly shows the very hard limits of those laws.
> Do you argue things have not improved in the last year with reasoning systems?
> Frankly, I find your aggressiveness quite tiring
having to answer for opinions with no basis in the literature is I'm sure very tiring for you. Your aggression being met is I'm sure uncomfortable.
> I love it when people include links to papers that refute their words.
> So, Antropic (which is heavily reliant on hype and making models appear more than they are) authors a paper which clearly states: "tokens later in context are easier to predict and there's less loss of tokens. For no reason at all we decided to give this a new name, in-context learning".
well I don't really love it when people just totally misread a paper because they have an agenda to push and can't seem to accept that their opinions are contradicted by real evidence.
in-context learning is not "later tokens easier" it’s task adaptation from examples in the prompt. I'm sure you realize this. Models can learn a mapping (e.g. word --> translation) from a few examples in the prompt, apply inputs within the same forward pass. That is function learning at inference time, not just "predicting later tokens better"
I'm sure also you're happy to chalk up any contradicting evidence to a grand conspiracy of all AI companies just gaming benchmarks and that this gaming somehow completely explains progress.
> That is they rebuild the world from scratch for every new session, and can't build on what was learned or built in the last one.
That they rebuild the world from scratch (wrong, they have priors from pretraining, but I accept your point here) does not mean they can't build on what was learned or built in the last one. They have access to the full transcript, and they have access to the full codebase, the diff history, whatever knowledge base is available. It's just disingenuous to say this, and then it also assumes (1) there is no mitigation for this, which I have presented twice before and you don't seem to understand it, (2) this is a temporary limitation, continual learning is one of the most important and well funded problems right now.
> 10 years ago I worked in a team implementing royalties for a streaming service. I can still give you a bunch of details, including references to multiple national laws, about that. Agents would exhaust their context window just re-"learning" it from scratch, every time. And they would miss a huge amount of important context and business implications.
also not an accurate understanding of how agents and their context work; you can use multiple session to digest and distill information useful in other sessions and in fact Claude does this automatically with subagents. It's a problem we have _already sort of solved today_ and that will continue to improve.
> You keep referencing this literature as it was Holy Bible. Meanwhile the one you keep referring to, Chinchilla, clearly shows the very hard limits of those laws.
You keep dismissing this literature as if you have understood it and that your opinion somehow holds more weight...Can you elaborate on why you think Chinchilla shows the hard limits of the scaling laws? Perhaps you're referring to the term capturing the irreducible loss? Is that what you're saying?
> Do you argue things have not improved in the last year with reasoning systems? I don't
Then are you arguing this progress will stop? I'm just not sure I understand, you seem to contradict yourself
Almost every task that people are tackling agents on, it’s either not worth doing, can be done better with scripts and software, or require human oversight (that negates all the advantages.
I assume this is a troll because it's just so far removed from reality there's not much to say. "Almost every task" -- I'm sure you have great data to back this up. "It's not worth doing" well sure if you want to put your head in the sand and ignore even what systems today can do let alone the improvement trajectory. "can be done better with scripts and software" .... not sure if you realize this but agents write scripts and software. "or require human oversight (that negates all the advantages." it certainly does not; human oversight vs actual humans implementing the code is pretty dramatically more efficient and productive.
But the issue isn't coding, it's doing the right thing. I don't see anywhere in your plan some way of staying aligned to core business strategy, forethought, etc.
The number of devs will reduce but there will still be large activities that can't be farmed out without an overall strategy
Why do you think this is a problem? Reasoning is constantly improving, it has ample access to humans to gather more business context, it has access to the same industry data and other signals that humans do, and it can get any data necessary. It has Zoom meeting notes, I mean why do people think there's somehow a fundamental limit beyond coding?
The other thing you're missing here is generalizability. Better coding performance (which is verifiable and not limited by human data quality) generalizes performance on other benchmarks. This is a long known phenomenon.
Every investment has a date where there should be a return on that investment. If there’s no date, it’s a donation of resources (or a waste depending on perspective).
You may be OK with continuing to try to make things work. But others aren’t and have decided to invest their finite resources somewhere else.
Ah ok so you didn't really read my comment, what is your counter argument? Models are just fundamentally incapable of understanding business context? They are demonstrably already capable of this to a large extent.
> Every investment has a date where there should be a return on that investment. If there’s no date, it’s a donation of resources (or a waste depending on perspective).
what are you implying here? This convo now turns into the "AI is not profitable and this is a house of cards" theme? That's ok, we can ignore every other business model like say Uber running at a loss to capture what is ultimately an absolutely insane TAM. Little ol' Uber accumuluated ~33B in losses over 14 years, and you're right they tanked and collapsed like a dying star...oh wait...hmm interesting I just looked at their market cap and it's 141 Billion.
> You may be OK with continuing to try to make things work. But others aren’t and have decided to invest their finite resources somewhere else.
I truly love that. If you want to code as a hobby that is fantastic, and we can go ahead and see in 2 years how your comment ages.
> They are demonstrably already capable of this to a large extent.
I’d very like to see such demonstration. Where someone hands over a department to an agent and let it makes decisions.
> This convo now turns into the "AI is not profitable and this is a house of cards" theme?
Where did I say that? I didn’t even mention money, just the broader resource term. A lot of business are mostly running experiments if the current set of tooling can match the marketing (or the hype). They’re not building datacenters or running AI labs. Such experiments can’t run forever.
> I’d very like to see such demonstration. Where someone hands over a department to an agent and let it makes decisions.
That's your bar for understanding business context? I thought we were talking about what you actually said which is: understanding business context. If I brainstorm about a feature it will be able to pull the compendium of knowledge for the business (reports, previous launches, infrastructure, an understanding of the problem space, industry, company strategy). That's business context.
> Where did I say that? I didn’t even mention money, just the broader resource term. A lot of business are mostly running experiments if the current set of tooling can match the marketing (or the hype). They’re not building datacenters or running AI labs. Such experiments can’t run forever.
I misunderstood you then, I wasn't sure what point you were trying to make. Is your point "companies are trying to cajole Claude to do X and it doesn't work and hasn't for the last year so they are giving up"? If so I think that is a wonderful opportunity for people that understand the nuance of these systems and the concept of timing.
I love everything about this direction except for the insane inference costs. I don’t mind the training costs, since models are commoditized as soon as they’re released. Although I do worry that if inference costs drop, the companies training the models will have no incentive to publish their weights because inference revenue is where they recuperate the training cost.
Either way… we badly need more innovation in inference price per performance, on both the software and hardware side. It would be great if software innovation unlocked inference on commodity hardware. That’s unlikely to happen, but today’s bleeding edge hardware is tomorrow’s commodity hardware so maybe it will happen in some sense.
If Taalas can pull off burning models into hardware with a two month lead time, that will be huge progress, but still wasteful because then we’ve just shifted the problem to a hardware bottleneck. I expect we’ll see something akin to gameboy cartridges that are cheap to produce and can plug into base models to augment specialization.
But I also wonder if anyone is pursuing some more insanely radical ideas, like reverting back to analog computing and leveraging voltage differentials in clever ways. It’s too big brain for me, but intuitively it feels like wasting entropy to reduce a voltage spike to 0 or 1.
Inference costs at least seem like the thing that is easiest to bring down, and there's plenty of demand to drive innovation. There's a lot less uncertainty here than with architectural/capability scaling. To your point, tomorrow's commodity hardware will solve this for the demands of today at some point in the future (though we'll probably have even more inference demand then).
> I love everything about this direction except for the insane inference costs.
If this direction holds true, ROI cost is cheaper.
Instead of employing 4 people (Customer Support, PM, Eng, Marketing), you will have 3-5 agents and the whole ticket flow might cost you ~20$
But I hope we won't go this far, because when things fail every customer will be impacted, because there will be no one who understands the system to fix it
I worry about the costs from an energy and environmental impact perspective. I love that AI tools make me more productive, but I don't like the side effects.
Environmental impact of ai is greatly overstated. Average person will make bigger positive impact on environment by reducing his meat intake by 25% compared with combined giving up flying and AI use.
This is the wrong way to see it. If a technology gets cheaper, people will use more and more and more of it. If inference costs drop, you can throw way more reasoning tokens and a combination of many many agents to increase accuracy or creativity and such.
No company at the moment has enough money operate with 10x the reasoning tokens of their competitors because they're bottlenecked by GPU capacity (or other physical constraints). Maybe in lab experiments but not for generally available products.
And I sense you would have to throw orders of magnitude more tokens to get meaningfully better results (If anyone has access to experiments with GPT 5 class models geared up to use marginally more tokens with good results please call me out though).
I think that as a user I'm so far removed from the actual (human) creation of software that if I think about it, I don't really care either way.
Take for example this article on Hacker News: I am reading it in a custom app someone programmed, which pulls articles hosted on Hacker News which themselves are on some server somewhere and everything gets transported across wires according to a specification. For me, this isn't some impressionist painting or heartbreaking poem - the entity that created those things is so far removed from me that it might be artificial already.
And that's coming from a kid of the 90s with some knowledge in cyber security, so potentially I could look up the documentation and maybe even the source code for the things I mentioned; if I were interested.
Dive into a forest, you'll find a couple of cool trees.
Art isn't about being cool. Art is about context.
When I tell people that art cannot be unpolitical, they react strongly, because they think about the left/right divide and how divided people are, where art is supposed to be unifying.
But art is like movement, you need an origin and a destination. Without that context, it will be just another... thing. Context makes it something.
It's not that you know the artist first and then say "this art is cool because I like the artist". The art is the means by which you know the artist. The more of their works you encounter, the closer you get to understanding the artist and what they are trying to communicate.
I think Anthropic will launch backend hosting off the back of their Bun acquisition very soon. It makes sense to basically run your entire business out of Claude, and share bespoke apps built by Claude code for whatever your software needs are.
100% its going to happen - also OpenAI will do same, there were already rumors about them building internal "github" which is stepping stone for that
Also it is requirement for completing lock-in - the dream for these companies.
I think some type of tickets can be done like this but your trusted user assumption does a lot of work here. Now I don't see this getting better than that with the current architecture of LLMs, you can do all sorts of feedback mechanisms which helps but since LLMs are not conscious drift is unavoidable unless there is a human in the loop that understands and steers what's going on.
But I do think even now with certain types of crud apps, things can be largely automated. And that's a fairly large part of our profession.
In the past three weeks a couple of projects I follow have implemented AI tools with their own github accounts which have been doing exactly this. And they appear to be doing good work! Dozens of open issues iterated, tested and closed. At one point i had almost 50 notification for one projects backlog being eradicated in 24 hours. The maintainer reviewed all of it and some were not merged.
What kind of software are people building where AI can just one shot tickets? Opus 4.6 and GPT 5.4 regularly fail when dealing with complicated issues for me.
I dunno if Rust async or native platform API's which have existed for years count as new patterns, but if you throw even a small wrench in the works they really struggle. But that's expected really when you look at what the technology is - it's kind of insane we've even gotten to this point with what amounts to fancy autocomplete.
Of course not all tickets are complex. Last week I had to fix a ticket which was to display the update date on a blog post next to the publish date. Perfect use case for AI to one shot.
i dont see anyone sane trusting ai to this degree any time soon, outside of web dev. the chances of this strategy failing are still well above acceptable margins for most software, and in safety critical instances it will be decades before standards allow for such adoption. anyway we are paying pennies on the dollar for compute at the moment - as soon as the gravy train stops rolling, all this intelligence will be out of access for most humans. unless some more efficient generalizable architecture is identified.
> as soon as the gravy train stops rolling, all this intelligence will be out of access for most humans. unless some more efficient generalizable architecture is identified.
All Chinese labs have to do to tank the US economy is to release open-weight models that can run on relatively cheap hardware before AI companies see returns.
Maybe that's why AI companies are looking to IPO so soon, gotta cash out and leave retail investors and retirement funds holding the bag.
i was under the impression that we were approaching performance bottlenecks both with consumer GPU architecture and with this application of transformer architecture. if my impression is incorrect, then i agree it is feasible for china to tank the US economy that way (unless something else does it first)
I think it just needs to be efficient or small enough for companies to deploy their own models on their hardware or cloud, for more inference providers to come out of the woodwork and compete on price, and/or for optimized models to run locally for users.
Regarding the latter, smaller models are really good for what they are (free) now, they'll run on a laptop's iGPU with LPDDR5/DDR5, and NPUs are getting there.
Even models that can fit in unified 64GB+ memory between CPU & iGPU aren't bad. Offloading to a real GPU is faster, but with the iGPU route you can buy cheaper SODIMM memory in larger quantities, still use it as unified memory, eventually use it with NPUs, all without using too much power or buying cards with expensive GDDR.
Qwen-3.5 locally is "good enough" for more than I expected, if that trend continues, I can see small deployable models eventually being viable & worthy competition, or at least being good enough that companies can run their own instead of exfiltrating their trade secrets to the worst people on the planet in real-time.
I don't think anybody is doubting its ability to generate thousands of PR's though. And yes, it's usually in the stuff that should have been automated already regardless of AI or not.
Depends on your circle. On HN I would argue that there are still a fair number of people that would be surprised to see what heavy organizational usage of AI actually looks like. On a non programming online group, of which I am a member of several, people still think that AI agents are the same as they were in mid 2025 and they can't answer "how many R's are in the following word:". Same thing even when chatting with my business owner friends. The majority of the public has no clue of the scale of recent advancement.
these companies contribute to swathes of the west's financial infrastructure, not quite safety critical but critical enough, insane to involve automation here to this degree
Even in webdev it rots your codebase unchecked. Although it's incredibly useful for generating UI components, which makes me a very happy webslopper indeed.
im grateful to have never bothered learning web dev properly, it was enlightening witnessing chat gpt transform my ten second ms paint job into a functional user interface
I don't know if this is the future, but if it is, why bother building one version of the software for everyone? We can have agents build the website for each user exactly the way they want. That would be the most exciting possibility to come out of AI-generated software.
A PR tells me what changed, but not how an AI coding session got there: which prompts changed direction, which files churned repeatedly, where context started bloating, what tools were used, and where the human intervened.
I ended up building a local replay/inspection tool for Claude Code / Cursor sessions mostly because I wanted something more reviewable than screenshots or raw logs.
I dont mean this as a shade but ppl who are not coders now seem to think "coding is now solved" and seem to be pushing absurd ideas like shipping software with slack messages. These ppl are often high up in the chain and have never done serious coding.
Stripe is apparently pushing gazzaliion prs now from slack but their feature velocity has not changed. so what gives?
how is that number of pr is now the primary metric of productivity and no one cares about what is being shipped or if we are shipping product faster. Its total madness right now. Everyone has lost their collective minds.
cto and ceo are now feeling insane pressure to show how they are using ai but its not evident in output. So now they've resorted to blabbering publicly about prs, lines of code ect to save face. And ofcourse ppl giving them voice and platform have their own agendas that prevent them from asking "so what exactly have you shipped stripe from million pr/day".
Its baffling to see these comments on hacknernews though. I guess you have to prove that you are not a luddite by making "ai forward" predictions and show that you "get it"
I am already there with a project/startup with a friend. He writes up an issue in GitHub and there is a job that automatically triggers Claude to take a crack at it and throw up a PR. He can see the change in an ephemeral environment. He hasn't merged one yet, but it will get there one day for smaller items.
I am already at the point where because it is just the two of us, the limiting factor is his own needs, not my ability to ship features.
We dont have product managers or technical ticket writers of any sort
But us devs are still choosing how to tackle the ticket, we def don't have to as I’m solving the tickets with AI. I could automate my job away if I wanted, but I wouldn't trust the result as I give a degree of input and steering, and there’s bigger picture considerations its not good at juggling, for now
> I feel like we are just inching closer and closer to a world where rapid iteration of software will be by default.
There's a lots of experimentation right now, but one thing that's guaranteed is that the data gatekeepers will slam the door shut[1] - or install a toll-booth when there's less money sloshing about, and the winners and losers are clear. At some point in the future, Atlassian and Github may not grant Anthropic access to your tickets unless you're on the relevant tier with the appropriate "NIH AI" surcharge.
1. AI does not suspend or supplant good old capitalism and the cult of profit maximization.
I feel like a lot of people and companies wanted to automate the web, but most website's operators wouldn't let you and would block you. Now you put the name AI into and now you're allowed to do It.
I remember when I tried to set something up with the ChatGPT equivalent like "notify me only if there are traffic disruptions in my route every morning at 8am" and it would notify me every morning even if there was no disruption.
This is because for some reason all agentic systems think that slapping cron on it is enough, but that completely ignores decades of knowledge about prospective memory. Take a look at https://theredbeard.io/blog/the-missing-memory-type/ for a write-up on exactly that.
“A programmer is going to the store and his wife tells him to buy a gallon of milk, and if there are eggs, buy a dozen. So the programmer goes shopping, does as she says, and returns home to show his wife what he bought. But she gets angry and asks, ‘Why’d you buy 13 gallons of milk?’ The programmer replies, ‘There were eggs!’”
"I need to fly to NY next weekend, make the necessary arrangement".
Your AI assistant orders an experimental jetpack from a random startup lab. Would you have honestly guessed that the prompt was "ambiguous" before you knew how the AI was going to act on it ?
This doesn't seem to hard to solve except for the ever so recurring llm output validation problem. If the true positive is rare you don't know if the earthquake alert system works until there's an earthquake.
... just force the data into a structured format, then use "hard code" on the structure.
"Generate the following JSON formatted object array representing the interruptions in my daily traffic. If no results, emit []. Send this at 8am every morning. {some schema}. Then run jsonreporter.py"
Then just let jsonreporter.py discriminate however it likes. Keep the LLMs doing what they are good at, and keep hard code doing what it's good at.
I do feel people will end up using this for things where a deterministic rule could be used - more effective, faster and cheaper. See this starting to happen at work...'We need AI to solve X....no you don't"
Maybe. The problem of "execute task on a cron" is something I've noticed the industry seems to refuse to solve in general, as if intentionally denying this capability for regular people. Even without AI, it's the most basic block of automation, and is always mysteriously absent from programs and frameworks (at least at the basic level). AI only makes it more useful on "then" side, but reliable cron on "if" side is already useful.
Most of the industry today is educated to avoid manual hacky solutions on single servers. You need to have fancy UI, frameworks with easy feedback and layers on top of layers who maintain other layers. Cron is an ancient tool with arcane syntax which offer barely anything out of the box, you have to know it and work it to get something out of it.
And there is also the mindset to avoid boring loops, and prefer event driven solutions for optimal resource-usage. So people also have a kind of blind spot for this functionality.
I don’t recall if IFTTT had/has a basic cron or not, but it sure has/had put a lot of basic automations in the hands of the general public. Same for Apple Shortcuts, to some extent, or Zapier.
This is a larger topic that's worthy of a comparably large rant, which I really don't want to do right now, but to keep it short, in my subjective view:
- IFTTT was great when it started; at some point, it became... weird, in a "I don't even know what's going on on my screen, is this a poster or an app" kind of way.
- Zapier is an unpenetrable mess, evidently targets marketers and other business users; discovery is hard, and even though it seems like it has everything, it - like all tools in this space - is always missing the one feature you actually need.
- Yahoo Pipes, I heard they were great, but I only learned about them after they shut down.
- Apple Shortcuts - not sure what you can do with those, but over the years of reading about them in HN comments, I think they may be the exception here, in being both targeting regular users and actually useful.
- Samsung Modes and Routines - only recently becoming remotely useful, so that's nice, even if vendor-restricted.
- Tasker - an Android tool that actually manages to offer useful automation, despite the entire platform/OS and app ecosystem trying its best to prevent it. Which is great, if your main computer is a phone. It sucks in a world of cloud/SaaS, because it creates a silly situation where e.g. I could nicely automate some things involving e-mail and calendars from Tasker + FairEmail, but... well my mailboxes and calendars lives in the cloud so some of that would conflict with use of vendor (Fastmail) webapp or any other tool.
Or, in short: we need Tasker but for web (and without some of the legacy baggage around UI and variable handling).
The sorry state of automation is not entirely, or even mostly, the fault of the automation platforms. I may have issues with some UI and business choices some of these platforms made, but really, the main issue is that integrations are business deals and the integrated sides quickly learned to provide only a limited set of features - never enough to allow users to actually automate use of some product. There's always some features missing. You can read data but not write it. You can read files and create new files but not edit or delete them. You can add new tasks but can't get a list of existing ones. Etc.
It's another reason LLMs are such a great thing to happen - they make it easy (for now) to force interoperability between parties that desperately want to prevent it. After all, worst case, I can have the LLM operate the vendor site through a browser, pretending to be a human. Not very reliable, but much better than nothing at all.
Similarly short on reply here, but quickly: IFTTT: hah, I agree. It was awesome when it was more about IoT than Spotify to Google Sheets.
And re: Zapier: yes, that’s the key to Zapier, from my experience: usage in marketing and the “power user” base.
Re: shortcuts: (I live in the Apple ecosystem) Shortcuts + AppleScript is gold on macOS. Shortcuts + iOS is about to be game changing - it already changed the game, it’s just nobody has been playing it, because it’s not “fun”.
After Siri+Gemini+Shortcuts, everyone will be playing it, I suspect, even on Android, it will get built somehow.
Node RED is still unwieldy for the masses, as easy as it is for a consumer to install, it’s not necessarily as easy to use.
Consumer grade automations built on node-RED? I suppose it depends on the market, but most people aren’t going to want to fiddle with it, I suspect.
A plugin for Chrome might be able to take off though, or some killer mobile app, but it needs to run on a cheap phone and control things without having to keep track of loops and logic and variables and all the fun stuff.
None of the tools here are for the masses. Automation in itself is already hard to grasp for the average user, and while some of those are simpler to start than others, they all are wall to climb.
Agree. How would you solve this in general, what would be the ingredients? People use things like zapier, n8n, node-red to achieve this today but in many cases are overkill.
Honestly, you just need cron (and Ruby/Python/bash/whatever) on an EC2. It's not very fashionable, but it works, will continue to work forever, and costs hardly anything.
> Analyzing CI failures overnight and surfacing summaries
Look like on ec2 with python? Because with Claude, it’s that prompt, and with your solution it’s infra + security groups + multiple APIs + whatever code you actually write
I would suggest the prompt is an example of garbage in that's going to produce garbage out. Sitting down to confront the problem you're solving will show this, while Claude is going to happily spit out what looks like a plausibly functional system.
So for example the only "analysis" of CI failures are which systems failed and who/what committed the changes to those things. The only way AI would help me here is if the system was so jank that the sole primitive i can use is textual analysis of log files. Which granted is probably real for a lot of software firms, but I really hope I have better build and test infrastructure than that.
> I would suggest the prompt is an example of garbage in that's going to produce garbage out. Sitting down to confront the problem you're solving will show this, while Claude is going to happily spit out what looks like a plausibly functional system.
I think this shows the value.
> Which granted is probably real for a lot of software firms
Here's the rub though; for many many people it's a huge improvement over what they have right now.
I'd start with solving the UX issues, specifically expectations and UI around scheduling jobs.
Expectations - the functionality of "do X on a timer" needs to be offered to users as a proper end-user feature[0], not treated as a sysadmin feature (Windows, Linux) or not provided at all (Android). People start seeing it on their own devices, they'll start using it, then expecting it, and the web will adjust too[1].
UI - somehow this escapes every existing solution, from `cron` through Windows timers to any web "on timer" event trigger in any platform ever. There already exists a very powerful UI paradigm for managing recurring tasks, that most normies know how to use, because they're already using it daily at work and privately: a calendar. Yes, that thing where we can set and manage recurring events, and see them at a glance, in context of everything else that's going on in our lives.
--
<rant>
I know those are hard problems, but are hard mostly because everybody wants to be the fucking one platform owning users and the universe. This self-inflicted sickness in computing is precisely why people will jump at AI solutions for this. Why I too will jump on this: because it's easier than dealing with all the systems and platforms that don't want to cooperate.
After all, at this point, the easiest solution to the problems I listed above, and several others in this space, would be to get an AI agent that I can:
1) Run on a cron every 30 minutes or so (events are too complicated);
2) Give it read (at minimum) access to my calendar and todo lists (the ones I use, but I'm willing to compromise here);
3) Give it access to other useful tools
Which I guess brings us to the actual root problem here. "Run tasks on a cron" and "run tasks on trigger" are basically just another way of saying unattended/non-interactive usage. That is what is constantly being denied end users.
This is also the key to enabling most value of AI tools, too, and people understand it very well (see the popularity of that Open Claw thing as the most recent example), but the industry also lives in denial, believing that "lethal trifecta" is a thing that can be solved.
</rant>
--
[0] - This extends to event triggers ("if X happens, then") automation, and end-user automation in all of every-day life. I mean, it's beyond ridiculous that the only things normal people are allowed to run automatically are dishwasher, and a laundry machine (and in the previous era, VCRs).
[1] - As a side effect, it would quickly debullshitify "smart home" / "internet of things" spaces a lot. The whole consumer side of the market revolves around selling people basic automation capabilities - except vendor-locked, and without the most useful parts.
> See this starting to happen at work...'We need AI to solve X....no you don't"
Same. Sometimes it is just people overeager to play with new toys, but in our case there is a push from the top & outside too: we are in the process of being subsumed into a larger company (completion due on April the 1st, unless the whole thing is an elaborate joke!) and there is apparently a push from the investors there to use "AI" more in order to not "get left behind the competition".
Its self perpetuating, I was talking to CEO of a Series A level B2B SaaS company here in UK recently. Most of the propspects his sales team are hitting are re-allocating their wallets to only looking for products that use AI on back of senior management pushing them to do so.
This company already does some pretty cool stuff with statistics for forecasting but now they are pivoting their roadmap to bake in GenAI into their offering over some other features that would be more valuable to their clients.
I'd say that's almost fine if they can start expressing intent correctly and thinking what good looks like. They (or some automated thing if you're building "think for them" type of products instead of "give them tools and teach them to think how to use them") can then freeze determism more and more were useful
I wrote this to help people (not just Devs) reason about agent skills
I feel this would be more useful for tasks like "Check website X to see if there are any great deals today". Specifically, tasks that are loosely defined and require some form of intuition.
The problem I'd think, for the average user, would be writing the 'then' part of any deterministic rule — that would require coding, or at least some kind of automation script (visual or otherwise) that's basically coding in a trench coat, which for most people is still a barrier to entry and annoying. I think that's why they'd use AI tbh — they can just describe what they want in natural language with AI.
People are loading huge interpreted environments for stuff that can be done from the command line. Run computations on complex objects where it could be a single machine instruction etc. The trend has been around for a long time.
Standard pendulum swing. Most people want to disengage their thinking circuits most of the time, so problems can't be evaluated one by one. There is no such thing as "this is a good solution for some problems". It can only be "this is a good solution for all problems". When the pendulum swings this far, this hard, it will swing all the way back eventually.
I've recently switched from GitHub Copilot Pro to Claude Code Max (20x). While Claude is clearly superior in many aspects, one area where it falls short is remote/cloud agents.
Yesterday, I spent the entire day trying to set up "Claude on the web" for an Elixir project and eventually had to give up. Their network firewall kept killing Hex/rebar3 dependency resolution, even after I selected "full" network access.
The environment setup for "on the web" is just a bash script. And when something goes wrong, you only see the tail of the log. There is currently no way to view the full log for the setup script. It's really a pain to debug.
The Copilot equivalent to "Claude on the web" is "GitHub Copilot Coding Agents," which leverages GitHub Actions infrastructure and conventions (YAML files with defined steps). Despite some of the known flaws of GitHub Actions, it felt significantly more robust.
"Schedule task on the web" is based on the same infrastructure and conventions as "Claude on the web", so I'm afraid I'm gonna have the same troubles if I want to use this.
Looks like I'm limited to only 3 cloud scheduled tasks. And I'm on the Max 20x plan, too :(
"Your plan gets 3 daily cloud scheduled sessions. Disable or delete an existing schedule to continue."
But otherwise, this looks really cool. I've tried using local scheduled tasks in both Claude Code Desktop and the Codex desktop app, and very quickly got annoyed with permissions prompts, so it'll be nice to be able to run scheduled tasks in the cloud sandbox.
Here are the three tasks I'll be trying:
Every Monday morning: Run `pnpm audit` and research any security issues to see if they might affect our project. Run `pnpm outdated` and research into any packages with minor or major upgrades available. Also research if packages have been abandoned or haven't been updated in a long time, and see if there are new alternatives that are recommended instead. Put together a brief report highlighting your findings and recommendations.
Every weekday morning: Take at Sentry errors, logs, and metrics for the past few days. See if there's any new issues that have popped up, and investigate them. Take a look at logs and metrics, and see if anything seems out of the ordinary, and investigate as appropriate. Put together a report summarizing any findings.
Every weekday morning: Please look at the commits on the `develop` branch from the previous day, look carefully at each commit, and see if there are any newly introduced bugs, sloppy code, missed functionality, poor security, missing documentation, etc. If a commit references GitHub issues, look up the issue, and review the issue to see if the commit correctly implements the ticket (fully or partially). Also do a sweep through the codebase, looking for low-hanging fruit that might be good tasks to recommend delegating to an AI agent: obvious bugs, poor or incorrect documentation, TODO comments, messy code, small improvements, etc.
I ran all of these as one-off tasks just now, and they put together useful reports; it'll be nice getting these on a daily/weekly basis. Claude Code has a Sentry connector that works in their cloud/web environment. That's cool; it accurately identified an issue I've been working on this week.
I might eventually try having these tasks open issues or even automatically address issues and open PRs, but we'll start with just reports for now.
Scheduling is easy. The hard part is everything between "started" and "done" - task needs human approval at step 3, fails at step 5 (retry from 4 or from scratch?), takes 6 hours and something restarts. How do they handle tasks that span multiple inference calls? Is there checkpointing or does it start over?
We need to fight model providers trying to own memory, workflows and tooling. Don't give them an inch more of your software than needed even if there is a slight inconvenience setting up.
Why? As a user of these tools, I love the convenience factor of having one tool rather than wrangling dozens. It's why in the past I've used an IDE (JetBrains), a language created by the provider of the IDE (Kotlin), web framework created by the same people (ktor), etc.
This is very different to a framework, language or IDE. More comparable to apple or amazon trying to create corporate anti competitive hellscapes of enslaved users that have no agency, no dignity and no real choice, reduced to rent extraction targets. Just with much more dire consequences and much more at stake. We still have the power to make ai providers have no moat and be interchangeable commodity. But we have to fight for them to not get control of the other layers they are trying to grab. We are in a war, people who can still use claude code or other of their garbage tools, after anthropic threatened and shut off opencode, are very naive and ignorant.
From an outside perspective, this sounds hyperbolic. I don’t know why task scheduling would be a part of a war.
In fact, I re-read the article before submitting this comment just to make sure I wasn’t missing something. What on earth is so polarizing about a prompt being run recurrently? It’s a long-awaited feature that I’ve personally needed.
If you want to win your war, you’ll need better propaganda to recruit people. Start with me. My mind is open. Why should I join?
Please tie your claims concretely to this new feature. I’m interested in how adding this could erode open source software. To me they seem completely independent, and it’s a welcome change.
I can't remove the YouTube app off my phone. The mobile phone is a locked up landscape that hates general purpose computing that puts the owner of the device in control. In the same way the big LLM want to give you stuff for free / subsidized then become very opinionated about how you use this stuff then pave up the entire landscape and monopolize it for themselves. Screw that.
We are at a war of defending control over our tools from AI companies that try to takeover any adjacent technology and anything that can be turned into a platform with lock- in effect. Subsidising subscriptions and locking people into their cli is just the start.
"A scheduled task runs a prompt on a recurring cadence using Anthropic-managed infrastructure." >> There is no other way to read this as in this context, its just a small feature, but its a land grab to run workflows locked into their cloud not just models, we don't fall for regimes in one go but one tiny piece at a time, like the frog in the water.
I paid a lot of attention to the opencode drama, and I still have a lot of respect for Dax, Adam, and the rest of that team. What I saw was a startup seeking to use API keys specific to Anthropic's subscription model, subsidized and intended for use solely by Anthropic's provided tooling. Anthropic also has an API usage-based model, for companies who want to create tooling around Anthropic models or integrate the models in their own products.
> Anthropic wants a world where they own your agent where it can't exist outside of the Claude desktop app or Claude Code.
Please. I'm sure you're referring to their locking down of subscription keys, which of course they are going to have restrictions on. It's a subsidized subscription model.
You've always been able to create a platform account and use API keys with usage-based billing, and that will never go away. Charging enough to make a profit on inference isn't exactly rent-seeking or whatever language you want to use to villainize a company trying to make enough revenue to survive.
hi, I don’t normally promote here, but I feel compelled to ask if you’d like to test my thing. it’s a personal agent / API for creating and managing background cloud agents that I’m 100% committed to keeping open source & accessible as an alternative platform to putting all your eggs in one basket. there is also a desktop app and expanding the api to involve storage. kind of like agentic dropbox that can also do coding and has a full computer and ability to spin up N agents
Very much like the idea. Thanks for sharing. Noticed that you are pushing this fully anonymously and wanted to chat with you regarding a project that I’m building. Mind contacting me on the address in my profile?
I can't pick the effort for the tasks run on Claude Web. I have a feeling Claude is using low or medium effort on those tasks, and I observe clear quality differences with the task ran on my local claude code, which uses high effort.
LinkedIn already employs anti-scraping measures, so I'd expect a lot of users to get flagged.
That's not unique to LinkedIn but what is somewhat unique is the strong linkage to real world identities, which raises the cost of Sybil attacks on personal networks with high trust.
i'm missing something basic here .... what does it actually do? It executes a prompt against a git repository. Fine - but then what? Where does the output go? How does it actually persist whatever the outcome of this prompt is?
Is this assuming you give it git commit permission and it just does that? Or it acts through MCP tools you enable?
i'd say it's more like intentionally choosing to use naive string interpolation for SQL queries than a trusted library's parameter substitution. Both work.
There is no "parameter substitution" equivalent possible. Prompt injection isn't like SQL injection, it has no technical solution (that isn't AGI-complete).
Prompt injection is "social engineering" but applied to LLMs. It's not a bug, it's fundamentally just a facet of its (LLM/human) general nature. Mitigations can be placed, at the cost of generality/utility of the system.
MCP itself is just an API. Unless the MCP server had a hidden LLM for some reason, it's still piece of regular, deterministic software.
The security risk here is the LLM, not the MCP, and you cannot secure the LLM in such system any more you can secure user - unless you put that LLM there and own it, at which point it becomes a question of whether it should've been there in the first place (and the answer might very well be "yes").
Oh my, did Anthropic invent Cron jobs as a service?
It's a game changer.
Edit: my mistake. It's inferior to a Cron job. If my repos happen to be self hosted with Forgejo or codeberg, then it won't even work. If I concede to use GitHub though I don't have to set up any env variables. Schedules lock-in, all over the web.
You jest, but for some reason the industry stubbornly refuses to solve the "cron job as a service" problem for end-users, whether on the web or in the OS.
I feel this is rooted in problems that extend beyond computing. Regular people are not allowed to automate things in their life. Consider that for most people, the only devices designed to allow unattended execution off a timer are a washing machine, some ovens and dishwashers, and an alarm clock (also VCRs in the previous era). Anything else requires manual actuation and staying in a synchronous loop.
> what happens when our paying users figure out they can get a better and cheaper model elsewhere.
They solved that with subscriptions. For end-users (and developers using AI for coding), it makes no sense to go for pay-as-you-go API use, as anything interesting will burn more than the monthly subscription worth of $$$ in API costs in few hours to days.
Yes but that's anthropic API pricing, some of the highest per token.
Sure subscription is a sort of tie in, but only if users are fooled into investing in workflows bound to anthropic. That's what the company is hooking them to do with this scheduler, banning open agentic framework and the rest.
The moat, if any, will be the tooling. Token is becoming a commodity, they know it.
That doesn't explain lack of such functionality at the OS/platform level. It technically exists on Linux and Windows, but is heavily optimized towards sysadmin use, and essentially hidden from regular users on the "normie UI surface". Most people don't even realize their computers could do things on a timer.
(And on Android, AFAIK there's exactly nothing at all. There's not even common support for any kind of basic automation; only recent exception is Samsung. From third-party apps, there's always been Tasker - very powerful, but the UX almost makes you want to learn to write Android apps instead.)
What is wrong with things like the Zapier scheduler? (ie https://zapier.com/apps/schedule/integrations) For running locally, there's also a plethora of cronlikes for every OS under the sun.
I think the core problem is not so much that it is not "allowed", but that even the most basic types of automation involves programming. I mean "programming" here in the abstract sense of "methodically breaking up a problem into smaller steps and control flows". Many people are not interested in learning to automate things, or are only interested until they learn that it will involve having to learn new things.
There is no secret conspiracy stopping people from learning to automate things, rather I think it's quite the opposite: many forces in society are trying to push people to automate more and more, but most are simply not interested in learning to do so. See for example the bazillion different "learn to code" programs.
It's not default. People don't need courses for this, they need availability and nudges. None of the platforms people use expose such features to users, much less encourage them to try. On the contrary, they hide or remove it from base UI layer entirely, and the UI choices made clearly suggest platform vendors don't even consider the possibility of regular people being interested.
Computing isn't, and has never been, demand-driven. It's all supply-driven. People choose from what's made available by vendors, and nobody bothers listening to user feedback.
Cron triggers (or specific triggers per connector like new email in Gmail, new linear issue, etc for built in connectors).
Then you can just ask in natural language when (whatever trigger+condition) happens do x,y and z with any configuration of connectors.
It creates an agentic chain to handle the events. Parent orchestrator with limited tools invoking workers who had access to only their specific MCP servers.
Official connectors are just custom MCP servers and you could add your own MCP servers.
I definitely had the most advanced MCP client on the planet at that point, supporting every single feature of the protocol.
I think that's why I wasn't blown away by OpenClaw, I had been doing my own form of it for a while.
I need to release more stuff for people to play around with.
My friends had use cases like "I get too many emails from my kids school I can't stay on top of everything".
So the automation was just asking "when I get an email from my kids school, let me know if there's anything actionable for me in it"
Better idea. Watch online feedback on this feature. Then implement things users want. Go niche. Join the forum and help them use Claude to its limits. Then be the next step for power users.
Welcome to Amazon playbook replayed again, most useful, profitable and popular use-cases will implemented by platform - and they will do it ruthlessly and quickly as money needs to be recouped.
That feature was silent launched about week ago for me.
I use it to:
- perform review of latest changes of code to update my documentation (security policies, user documentation etc.)
- perform review to latest changes of code, triage them, deduplicate and improve code - I review them, close them with comments for over-engoneering / add review for auto-fix
- perform review of open GitHub issue with label, select the one with highest impact, comment with rationale, implement it and make pull request - I wake up and I have a few pull request to fix issues that I can approve /finish in existing Claude Code thread
I want also use it to:
- review recent Sentry issues, make GitHub issues for the one with highest priority, make pull request with proposed fix - I can just wake up and see that some crash is ready to be resolved
Limit of 3 scheduled jobs is pretty impactful, but playing with it give me a nice idea on how I can reduce my manual work.
interesting to see feature launches are coming via official website while usage restrictions are coming in with a team member's twitter account - https://x.com/trq212/status/2037254607001559305.
also, someone rightly predicted this rugpull coming in when they announced 2x usage - https://x.com/Pranit/status/2033043924294439147
To me it makes perfect sense for them to encourage people to do this, rather than eg making things more expensive for everyone.
The same as charging a different toll price on the road depending on the time of day.
If you use the cloud providers you accept this and more.
If you want stability, own the means of inference and buy a Mac Studio or Strix Halo computer.
Funnily, Anthropic's pricing etc. why I'm using GLM-5 a bunch more outside of work. Definitely not Opus level, but surprisingly decent. Though I got lucky and got the Alibaba Coding Model lite plan, which is so cheap they got rid of it
I've been doing something similar. I use Claude for analysis and non-coding work, GLM for most coding tasks (GLM's coding plan) and when I need to do a larger implementation project I use GLM&Claude to build out an in depth plan and toss it to Github Copilot to Opus the implementation.
I was trying to get The alibaba plan but missed the mark. I'm curious to try out the Minimax coding plan (#10/mo) or Kimi ($20/mo) at some point to see how they stack up.
For Pricing: GLM was $180 for a year of their pro tier during a black friday sale and GHCP was $100/year but they don't have the annual plan any more so it is now $120. Alibaba's only coding plan today is $50/mo, too rich for me.
Does GLM-5 have multimodality or are they still wanting you to load an MCP for vision support?
Text only still, sadly, though qwen3.5-plus on the same provider (Model Studio) is
If you read the replies to the second, you’ll see an engineer on Claude Code at Anthropic saying that it is false.
Someone spread FUD on the internet, incorrectly, and now others are spreading it without verifying.
And if you look closely at the usernames, you see that the same engineer from link 2 that said "nah it’s just a bonus 2x, it’s not that deep" (just two week ago) is now saying "we're going to throttle you during peak hours" (as predicted).
Yes, it was FUD, but ended up being correct. With the track record that Anthropic has (e.g. months long denial of dumbed down models last year, just to later confirm it as a "bug"), this just continues to erode trust, and such predictions are the result of that.
Anthropic fixing that bug way faster than Apple fixing iOS keyboard "bug". Anthropic even acknowledged it, Apple gave us the silent treatment for years.
I'm not sure it's a rug pull when their stats show 7% and 2% subscription-level impacts. We're back in the ISP days, and they never said unlimited.
I feel like we are just inching closer and closer to a world where rapid iteration of software will be by default. Like for example a trusted user makes feedback -> feedback gets curated into a ticket by an AI agent, then turned into a PR by an Agent, then reviewed by an Agent, before being deployed by an Agent. We are maybe one or two steps from the flywheel being completed. Or maybe we are already there.
I just don’t see it coming. I was full on that camp 3 months ago, but I just realize every step makes more mistakes. It leads into a deadlock and when no human has the mental model anymore.
Don’t you guys have hard business problems where AI just cant solve it or just very slowly and it’s presenting you 17 ideas till it found the right one. I’m using the most expensive models.
I think the nature of AI might block that progress and I think some companies woke up and other will wake up later.
The mistake rate is just too high. And every system you implement to reduce that rate has a mistake rate as well and increases complexity and the necessary exploration time.
I think a big bulk of people is of where the early adaptors where in December. AI can implement functional functionality on a good maintained codebase.
But it can’t write maintable code itself. It actually makes you slower, compared to assisted-writing the code, because assisted you are way more on the loop and you can stop a lot of small issues right away. And you fast iterate everything•
I’ve not opened my idea for 1 months and it became hell at a point. I’ve now deleted 30k lines and the amount of issues I’m seeing has been an eye-opening experience.
Unscalable performance issues, verbosity, straight up bugs, escape hatches against my verification layers, quindrupled types.
Now I could monitor the ai output closer, but then again I’m faster writing it myself. Because it’s one task. Ai-assisted typing isn’t slower than my brain is.
Also thinking more about it FAANG pays 300$ per line in production, so what do we really trying to achieve here, speed was never the issue.A great coder writes 10 production lines per day.
Accuracy, architecture etc is the issue. You do that by building good solid fundamental blocks that make features additions easier over time and not slower
I know it’s not your main point, but I’m curious where $300/line comes from. I don’t think I’ve ever seen a dollar amount attached to a line of production code before.
I think this sounds like a true yet short sighted take. Keep in mind these features are immature but they exist to obtain a flywheel and corner the market. I don’t know why but people seem to consistently miss two points and their implications
- performance is continuing to increase incredibly quickly, even if you rightfully don’t trust a particular evaluation. Scaling laws like chinchilla and RL scaling laws (both training and test time)
- coding is a verifiable domain
The second one is most important. Agent quality is NOT limited by human code in the training set, this code is simply used for efficiency: it gets you to a good starting point for RL.
Claiming that things will not reach superhuman performance, INCLUDING all end to end tasks: understanding a vague business objective poorly articulated, architecting a system, building it out, testing it, maintaining it, fixing bugs, adding features, refactoring, etc. is what requires the burden of proof because we literally can predict performance (albeit it has a complicated relationship with benchmarks and real world performance).
Yes definitely, error rates are too high so far for this to be totally trusted end to end but the error rates are improving consistently, and this is what explains the METR time horizon benchmark.
Scaling laws vs combinatorial explosion, who wins? In personal experience claude does exceedingly well on mundane code (do a migration, add a field, wire up this UI) and quite poorly on code that has likely never been written (even if it is logically simple for a human). The question is whether this is a quantitative or qualitative barrier.
Of course it's still valuable. A real app has plenty of mundane code despite our field's best efforts.
Combinatorial explosion? What do you mean? Again, your experiences are true, but they are improving with each release. The error rate on tasks continues to go down, even novel tasks (as far as we can measure them). Again this is where verifiable domains come in -- whatever problems you can specify the model will improve on them, and this improvement will result in better generalization, and improvements on unseen tasks. This is what I mean by taking your observations of today, ignoring the rate of progress that got us here and the known scaling laws, and then just asserting there will be some fundamental limitation. My point is while this idea may be common, it is not at all supported by literature and the mathematics.
The space of programs is incomprehensibly massive. Searching for a program that does what you need is a particularly difficult search problem. In the general case you can't solve search, there's no free lunch. Even scaling laws must bow to NFL. But depending on the type of search problem some heuristics can do well. We know human brains have a heuristic that can program (maybe not particularly well, but passably). To evaluate these agents we can only look at it experimentally, there is no sense in which they are mathematically destined to eventually program well.
How good are these types of algorithms at generalization? Are they learning how to code; or are they learning how to code migrations, then learning how to code caches, then learning how to code a command line arg parser, etc?
Verifiable domains are interesting. It is unquestionably why agents have come first for coding. But if you've played with claude you may have experienced it short-circuiting failing tests, cheating tests with code that does not generalize, writing meaningless tests, and at long last if you turn it away from all of these it may say something like "honest answer - this feature is really difficult and we should consider a compromise."
So what do you think the difference is between humans and an agent in this respect? What makes you think this has any relevance to the problem? everything is combinatorially explosive: the combination of words that we can string into sentences and essays is also combinatorially explosive and yet LLMs and humans have no problem with it. It's just the wrong frame of thinking for what's going on. These systems are obtaining higher and higher levels of abstractions because that is the most efficient thing for them to do to gain performance. That's what reasoning looks like: compositions of higher level abstractions. What you say may be true but I don't see how this is relevant.
"There is no sense in which they are mathematically destined to eventually program well"
- Yes there is and this belies and ignorance of the literature and how things work
- Again: RL has been around forever. Scaling laws have held empirically up to the largest scales we've tested. There are known RL scaling laws for both training and test time. It's ludicrous to state there is "no sense" in this, on the contrary, the burden of proof of this is squarely on yourself because this has already been studied and indeed is the primary reason why we're able to secure the eye-popping funding: contrary to popular HN belief, a trillion dollars of CapEx spend is based on rational evidence-based decision making.
> "How good are these types of algorithms at generalization"
There is a tremendously large literature and history of this. ULMFiT, BERT ==> NLP task generalization; https://arxiv.org/abs/2206.07682 ==> emergent capabilities, https://transformer-circuits.pub/2022/in-context-learning-an... ==> demonstrated circuits for in context learning as a mechanism for generalization, https://arxiv.org/abs/2408.10914 + https://arxiv.org/html/2409.04556v1 ==> code training produces downstream performance improvements on other tasks
> Verifiable domains are interesting. It is unquestionably why agents have come first for coding. But if you've played with claude you may have experienced it short-circuiting failing tests, cheating tests with code that does not generalize, writing meaningless tests, and at long last if you turn it away from all of these it may say something like "honest answer - this feature is really difficult and we should consider a compromise."
You say this and ignore my entire argument: you are right about all of your observations, yet
- Opus 4.6 compared to Sonnet 3.x is clearly more generalizable and less prone to these mistakes
- Verifiable domain performance SCALES, we have no reason to expect that this scaling will stop and our recursive improvement loop will die off. Verifiable domains mean that we are in alphago land, we're learning by doing and not by mimicking human data or memorizing a training set.
Hey man, it sounds like you're getting frustrated. I'm not ignoring anything; let's have a reasonable discussion without calling each other ignorant. I don't dispute the value of these tools nor that they're improving. But the no free lunch theorem is inexorable so the question is where this improvement breaks down - before or beyond human performance on programming problems specifically.
What difference do I think there is between humans and an agent? They use different heuristics, clearly. Different heuristics are valuable on different search problems. It's really that simple.
To be clear, I'm not calling either superior. I use agents every day. But I have noticed that claude, a SOTA model, makes basic logic errors. Isn't that interesting? It has access to the complete compendium of human knowledge and can code all sorts of things in seconds that require my trawling through endless documentation. But sometimes it forgets that to do dirty tracking on a pure function's output, it needs to dirty-track the function's inputs.
It's interesting that you mention AlphaGo. I was also very fascinated with it. There was recent research that the same algorithm cannot learn Nim: https://arstechnica.com/ai/2026/03/figuring-out-why-ais-get-.... Isn't that food for thought?
What is unreasonable? I am saying the claims you are making are completely contradicted by the literature. I am calling you ignorant in the technical sense, not dumb or unintelligent, and I don't mean this as an insult. I am completely ignorant of many things, we all are.
I am saying you are absolutely right that Opus 4.6 is both SOTA and also colossally terrible in even surprisingly mundane contexts. But that is just not relevant to the argument you are making which is that there is some fundamental limitation. There is of course always a fundamental limitation to everything, but what we're getting at is where that fundamental limitation is and we are not yet even beginning to see it. Combinatorics here is the wrong lens to look at this, because it's not doing a search over the full combinatoric space, as is the case with us. There are plenty of efficient search "heuristics" as you call them.
> They use different heuristics, clearly.
what is the evidence for this? I don't see that as true, take for instance: https://www.nature.com/articles/s42256-025-01072-0
> It's interesting that you mention AlphaGo. I was also very fascinated with it. There was recent research that the same algorithm cannot learn Nim: https://arstechnica.com/ai/2026/03/figuring-out-why-ais-get-.... Isn't that food for thought?
It's a long known problem with RL in a particular regime and isn't relevant to coding agents. Things like Nim are a small, adversarially structured task family and it's not representative of language / coding / real-world tasks. Nim is almost the worst possible case, the optimal optimal policy is a brittle, discontinuous function.
Alphago is pure RL from scratch, this is quite challenging, inefficient, and unstable, and why we dont do that with LLMs, we pretrain them first. RL is not used to discover invariants (aspects of the problem that don't change when surface details change) from scratch in coding agents as they are in this example. Pretraining takes care of that and RL is used for refinement, so a completely different scenario where RL is well suited.
I didn't make any claims contradicted by literature. The only thing I cited as bedrock fact, NFL, is a mathematical theorem. I'm not sure why Nim shouldn't be relevant, it's an exercise in logic.
> “AlphaZero excels at learning through association,” Zhou and Riis argue, “but fails when a problem requires a form of symbolic reasoning that cannot be implicitly learned from the correlation between game states and outcomes.”
Seems relevant.
> So what do you think the difference is between humans and an agent in this respect?
Humans learn.
Agents regurgitate training data (and quality training data is increasingly hard to come by).
Moreover, humans learn (somewhat) intangible aspects: human expectations, contracts, business requirements, laws, user case studies etc.
> Verifiable domain performance SCALES, we have no reason to expect that this scaling will stop.
Yes, yes we have reasons to expect that. And even if growth continues, a nearly flat logarithmic scale is just as useless as no growth at all.
For a year now all the amazing "breakthrough" models have been showing little progress (comparatively). To the point that all providers have been mercilessly cheating with their graphs and benchmarks.
> Where did I say that? I didn’t even mention money, just the broader resource term. A lot of business are mostly running experiments if the current set of tooling can match the marketing (or the hype). They’re not building datacenters or running AI labs. Such experiments can’t run forever.
I'm just going to ask that you read any of my other comments, this is not at all how coding agents work and seems to be the most common misunderstanding of HN users generally. It's tiring to refute it. RL in verifiable domains does not work like this.
> Humans learn.
Sigh, so do LLMs, in context.
> Moreover, humans learn (somewhat) intangible aspects: human expectations, contracts, business requirements, laws, user case studies etc.
Literally benchmarks on this all over the place, I'm sure you follow them.
> Yes, yes we have reasons to expect that. And even if growth continues, a nearly flat logarithmic scale is just as useless as no growth at all.
and yet its not logarithmic? Consider data flywheel, consistent algorithmic improvements, synthetic data [basically: rejection sampling from a teacher model with a lot of test-time compute + high temperature],
> For a year now all the amazing "breakthrough" models have been showing little progress (comparatively). To the point that all providers have been mercilessly cheating with their graphs and benchmarks.
Benchmaxxing is for sure a real thing, not to mention even honest benchmarking is very difficult to do, but considering "all of the AI companies are just faking the performance data" to be the "story" is tremendously wrong. Consider AIME performance on 2025 (uncontaminated data), the fact that companies have a _deep incentive_ to genuinely improve their models (and then of course market it as hard as possible, thats a given). People will experiment with different models, and no benchmaxxing is going to fool people for very long.
If you think Opus 4.6 compared to Sonnet 3.x is "little progress" I think we're beyond the point of logical argument.
> - coding is a verifiable domain
You're missing the point though. "1 + 1" vs "one.add(1)" might both be "passable" and correct, but it's missing the forest for the trees, how do you know which one is "long-term the right choice, given what we know?", which is the engineering part of building software, and less about "coding" which tends to be the easy part.
How do you evaluate, score and/or benchmark something like that? Currently, I don't think we have any methodologies for this, probably because it's pretty subjective in the end. That's where the "creative" parts of software engineering becomes more important, and it's also way harder to verify.
While I agree we don't have any methodologies for this, it's also true that we can just "fail" more often.
Code is effectively becoming cheap, which means even bad design decisions can be overturned without prohibitive costs.
I wouldn't be surprised if in a couple of years we see several projects that approach the problem of tech debt like this:
1. Instruct AI to write tens of thousands of tests by using available information, documentation, requirements, meeting transcripts, etc. These tests MUST include performance AND availability related tests (along with other "quality attribute" concerns) 2. Have humans verify (to the best of their ability) that the tests are correct -- step likely optional 3. Ask another AI to re-implement the project while matching the tests
It sounds insane, but...not so insane if you think we will soon have models better than Opus 4.6. And given the things I've personally done with it, I find it less insane as the days go by.
I do agree with the original poster who said that software is moving in this direction, where super fast iteration happens and non-developers can get features to at least be a demo in front of them fast. I think it clearly is and am working internally to make this a reality. You submit a feature request and eventually a live demo is ready for you, deployed in isolation at some internal server, proxied appropriately if you need a URL, and ready for you to give feedback and have the AI iterate on it. Works for the kind of projects we have, and, though I get it might be trickier for much larger systems, I'm sure everyone will find a way.
For now, we still need engineers to help drive many decisions, and I think that'll still be the case.These days all I do when "coding" is talking (via TTS) with Opus 4.6 and iterating on several plans until we get the right one, and I can't wait to see how much better this workflow will be with smarter and faster models.
I'm personally trying to adapt everything in our company to have agents work with our code in the most frictionless way we can think of.
Nonetheless, I do think engineers with a product inclination are better off than those who are mostly all about coding and building systems. To me, it has never felt so magical to build a product, and I'm loving it.
> Code is effectively becoming cheap, which means even bad design decisions can be overturned without prohibitive costs.
I'm sorry, but only someone who never maintained software long-term would say something like this. The further along you are in development, the magnitude of costs related to changing that increases, maybe even exponentially.
Correct the design before you even wrote code, might be 100x cheaper (or even 1000x) than changing that design 2 years later, after you've stored TBs of data in some format because of that decision, and lots of other parts of the company/product/project depends on those choices you made earlier.
You can't just pile on code on top of code, say "code is cheap" and hope for the best, it's just not feasible to run a project long-term that way, and I think if you had the experience of maintaining something long-term, you'd realize how this sounds.
The easiest part of "software engineering" is "writing code", and today "writing code" is even easier. But the hardest parts, actually designing, thinking and maintaining, remains the same as before, although some parts are easier, others are harder.
Don't get me wrong, I'm on the "agentic coding" train as much as everyone else, probably haven't written/edited a code by myself for a year at this point, but it's important to be realistic about what it actually takes to produce "worthwhile software", not just slop out patchy and hacky code.
I've never maintained software long-term so i could be wrong, but I interpret "code is cheap" to mean that you can have coding agents refactor or rewrite the project from scratch around the design correction. I don't think 'code is cheap' ever should be interpreted to mean ship hacky code.
I think using agents to prototype code and design will be a big thing. Have the agent write out what you want, come back with what works and what doesn't, write a new spec, toss out the old code and and have a fresh agent start again. Spec-driven development is the new hotness, but we know that the best spec is code, have the agent write the spec in code, rewrite the spec in natural language, then iterate.
because it has business context and better reasoning, and can ask humans for clarification and take direction.
You don't need to benchmark this, although it's important. We have clear scaling laws on true statistical performance that is monotonically related to any notion of what performance means.
I do benchmarks for a living and can attest: benchmarks are bad, but it doesn't matter for the point I'm trying to make.
I feel like you're missing the initial context of this conversation (no pun intended):
> Like for example a trusted user makes feedback -> feedback gets curated into a ticket by an AI agent, then turned into a PR by an Agent, then reviewed by an Agent, before being deployed by an Agent.
Once you add "humans for clarifications and take direction" then yeah, things can be useful, but that's far away from the non-human-involvment-loop earlier described in this thread, which is what people are pushing back against.
Of course, involving people makes things better, that's the entire point here, and that by removing the human, you won't get as good results. Going back to benchmarks, obviously involving humans aren't possible here, so again we're back to being unable to score these processes at all.
I'm confused on the scenario here. There is human in the loop, it's the feedback part...there is business context, it is either seeded or maintained by the human and expanded by the agent. The agent can make inferences about the world, especially when embodiment + better multimodal interaction is rolled out [embodiment taking longer].
Benchmarks ==> it's absolutely not a given that humans can't be involved in the loop of performance measurement. Why would that be the case?
> because it has business context
It doesn't because it doesn't learn. Every time you run it, it's a new dawn with no knowledge of your business or your business context
> better reasoning
It doesn't have better reasoning beyond very localized decisions.
> and can ask humans for clarification and take direction.
And yet it doesn't, no matter how many .md file you throw at it, at crucial places in code.
> We have clear scaling laws on true statistical performance that is monotonically related to any notion of what performance means.
This is just a bunch of words stringed together, isn't it?
> It doesn't because it doesn't learn. Every time you run it, it's a new dawn with no knowledge of your business or your business context
It does learn in context. And lack of continuous learning is temporary, that is a quirk of the current stack, expect this to change rather quickly. Also still not relevant, consider that agentic systems can be hierarchical and that they have no trouble being able to grok codebases or do internal searches effectively and this will only improve.
> It doesn't have better reasoning beyond very localized decisions.
Do you have any basis for this claim? It contradicts a large amount of direct evidence and measurement and theory.
> This is just a bunch of words stringed together, isn't it?
Maybe to yourself? Chinchilla scaling laws and RL scaling laws are measured very accurately based on next token test loss (Chinchilla). This scales very predictably. It is related to downstream performance, but that relationship is noisy but clearly monotonic
> It does learn in context
It quite literally doesn't.
It also doesn't help that every new context is a new dawn with no knowledge if things past.
> Also still not relevant, consider that agentic systems can be hierarchical and that they have no trouble being able
A bunch of Memento guys directing a bunch of other Memento guys don't make a robust system, or a system that learns, or a system that maintains and retains things like business context.
> and this will only improve.
We've heard this mantra for quite some time now.
> Do you have any basis for this claim?
Oh. Just the fact that in every single coding session even on a small 20kloc codebase I need to spend time cleaning up large amounts of duplicated code, undo quite a few wrong assumptions, and correct the agent when it goes on wild tangents and goose hunts.
> Maybe to yourself? Chinchilla scaling laws a
yap yap yap. The result is anything but your rosy description of these amazing reasoning learning systems that handle business context.
> It quite literally doesn't.
Awesome you've backed this up with real literature. Let's just include this for now to easily refute your argument which I don't know where it comes from: https://transformer-circuits.pub/2022/in-context-learning-an...
> It also doesn't help that every new context is a new dawn with no knowledge if things past.
Absolutely true that it doesn't help but: agents like Claude have access to older sessions, they can grok impressive amounts of data via tool use, they can compose agents into hierarchical systems that effectively have much larger context lengths at the expense of cost and coordination which needs improvement. Again this is a temporary and already partially solved limitation
> A bunch of Memento guys directing a bunch of other Memento guys don't make a robust system, or a system that learns, or a system that maintains and retains things like business context.
I think you are not understanding: hierarchical agents have long term memory maintained by higher level agents in the hierarchy, it's the whole point. It's annoying to reset model context, but yet you have a knowledge base of the business context persisted and it can grok it...
> We've heard this mantra for quite some time now.
yes you have, and it has held true and will continue to hold true. Have you read the literature on scaling laws? Do you follow benchmark progression? Do you know how RL works? If you do I don't think you will have this opinion.
> yap yap yap. The result is anything but your rosy description of these amazing reasoning learning systems that handle business context.
Well that's fine to call an entire body of literature "yap" but don't pretend like you have some intelligible argument, I don't see you backing up any argument you have here with any evidence, unlike the multitude of sources I have provided to you.
Do you argue things have not improved in the last year with reasoning systems? If so I would really love to hear the evidence for this.
> Let's just include this for now to easily refute your argument which I don't know where it comes from: https://transformer-circuits.pub/2022/in-context-learning-an...
I love it when people include links to papers that refute their words.
So, Antropic (which is heavily reliant on hype and making models appear more than they are) authors a paper which clearly states: "tokens later in context are easier to predict and there's less loss of tokens. For no reason at all we decided to give this a new name, in-context learning".
> agents like Claude have access to older sessions, they can grok impressive amounts of data via tool use
That is they rebuild the world from scratch for every new session, and can't build on what was learned or built in the last one.
Hence continuous repeating failure modes.
10 years ago I worked in a team implementing royalties for a streaming service. I can still give you a bunch of details, including references to multiple national laws, about that. Agents would exhaust their context window just re-"learning" it from scratch, every time. And they would miss a huge amount of important context and business implications.
> Have you read the literature on scaling laws?
You keep referencing this literature as it was Holy Bible. Meanwhile the one you keep referring to, Chinchilla, clearly shows the very hard limits of those laws.
> Do you argue things have not improved in the last year with reasoning systems?
I don't.
Frankly, I find your aggressiveness quite tiring
> Frankly, I find your aggressiveness quite tiring
having to answer for opinions with no basis in the literature is I'm sure very tiring for you. Your aggression being met is I'm sure uncomfortable.
> I love it when people include links to papers that refute their words. > So, Antropic (which is heavily reliant on hype and making models appear more than they are) authors a paper which clearly states: "tokens later in context are easier to predict and there's less loss of tokens. For no reason at all we decided to give this a new name, in-context learning".
well I don't really love it when people just totally misread a paper because they have an agenda to push and can't seem to accept that their opinions are contradicted by real evidence.
in-context learning is not "later tokens easier" it’s task adaptation from examples in the prompt. I'm sure you realize this. Models can learn a mapping (e.g. word --> translation) from a few examples in the prompt, apply inputs within the same forward pass. That is function learning at inference time, not just "predicting later tokens better"
I'm sure also you're happy to chalk up any contradicting evidence to a grand conspiracy of all AI companies just gaming benchmarks and that this gaming somehow completely explains progress.
> That is they rebuild the world from scratch for every new session, and can't build on what was learned or built in the last one.
That they rebuild the world from scratch (wrong, they have priors from pretraining, but I accept your point here) does not mean they can't build on what was learned or built in the last one. They have access to the full transcript, and they have access to the full codebase, the diff history, whatever knowledge base is available. It's just disingenuous to say this, and then it also assumes (1) there is no mitigation for this, which I have presented twice before and you don't seem to understand it, (2) this is a temporary limitation, continual learning is one of the most important and well funded problems right now.
> 10 years ago I worked in a team implementing royalties for a streaming service. I can still give you a bunch of details, including references to multiple national laws, about that. Agents would exhaust their context window just re-"learning" it from scratch, every time. And they would miss a huge amount of important context and business implications.
also not an accurate understanding of how agents and their context work; you can use multiple session to digest and distill information useful in other sessions and in fact Claude does this automatically with subagents. It's a problem we have _already sort of solved today_ and that will continue to improve.
> You keep referencing this literature as it was Holy Bible. Meanwhile the one you keep referring to, Chinchilla, clearly shows the very hard limits of those laws.
You keep dismissing this literature as if you have understood it and that your opinion somehow holds more weight...Can you elaborate on why you think Chinchilla shows the hard limits of the scaling laws? Perhaps you're referring to the term capturing the irreducible loss? Is that what you're saying?
> Do you argue things have not improved in the last year with reasoning systems? I don't
Then are you arguing this progress will stop? I'm just not sure I understand, you seem to contradict yourself
Almost every task that people are tackling agents on, it’s either not worth doing, can be done better with scripts and software, or require human oversight (that negates all the advantages.
I assume this is a troll because it's just so far removed from reality there's not much to say. "Almost every task" -- I'm sure you have great data to back this up. "It's not worth doing" well sure if you want to put your head in the sand and ignore even what systems today can do let alone the improvement trajectory. "can be done better with scripts and software" .... not sure if you realize this but agents write scripts and software. "or require human oversight (that negates all the advantages." it certainly does not; human oversight vs actual humans implementing the code is pretty dramatically more efficient and productive.
But the issue isn't coding, it's doing the right thing. I don't see anywhere in your plan some way of staying aligned to core business strategy, forethought, etc.
The number of devs will reduce but there will still be large activities that can't be farmed out without an overall strategy
Why do you think this is a problem? Reasoning is constantly improving, it has ample access to humans to gather more business context, it has access to the same industry data and other signals that humans do, and it can get any data necessary. It has Zoom meeting notes, I mean why do people think there's somehow a fundamental limit beyond coding?
The other thing you're missing here is generalizability. Better coding performance (which is verifiable and not limited by human data quality) generalizes performance on other benchmarks. This is a long known phenomenon.
> Why do you think this is a problem?
Because it cannot do it?
Every investment has a date where there should be a return on that investment. If there’s no date, it’s a donation of resources (or a waste depending on perspective).
You may be OK with continuing to try to make things work. But others aren’t and have decided to invest their finite resources somewhere else.
> Because it cannot do it?
Ah ok so you didn't really read my comment, what is your counter argument? Models are just fundamentally incapable of understanding business context? They are demonstrably already capable of this to a large extent.
> Every investment has a date where there should be a return on that investment. If there’s no date, it’s a donation of resources (or a waste depending on perspective).
what are you implying here? This convo now turns into the "AI is not profitable and this is a house of cards" theme? That's ok, we can ignore every other business model like say Uber running at a loss to capture what is ultimately an absolutely insane TAM. Little ol' Uber accumuluated ~33B in losses over 14 years, and you're right they tanked and collapsed like a dying star...oh wait...hmm interesting I just looked at their market cap and it's 141 Billion.
> You may be OK with continuing to try to make things work. But others aren’t and have decided to invest their finite resources somewhere else.
I truly love that. If you want to code as a hobby that is fantastic, and we can go ahead and see in 2 years how your comment ages.
> They are demonstrably already capable of this to a large extent.
I’d very like to see such demonstration. Where someone hands over a department to an agent and let it makes decisions.
> This convo now turns into the "AI is not profitable and this is a house of cards" theme?
Where did I say that? I didn’t even mention money, just the broader resource term. A lot of business are mostly running experiments if the current set of tooling can match the marketing (or the hype). They’re not building datacenters or running AI labs. Such experiments can’t run forever.
> I’d very like to see such demonstration. Where someone hands over a department to an agent and let it makes decisions.
That's your bar for understanding business context? I thought we were talking about what you actually said which is: understanding business context. If I brainstorm about a feature it will be able to pull the compendium of knowledge for the business (reports, previous launches, infrastructure, an understanding of the problem space, industry, company strategy). That's business context.
> Where did I say that? I didn’t even mention money, just the broader resource term. A lot of business are mostly running experiments if the current set of tooling can match the marketing (or the hype). They’re not building datacenters or running AI labs. Such experiments can’t run forever.
I misunderstood you then, I wasn't sure what point you were trying to make. Is your point "companies are trying to cajole Claude to do X and it doesn't work and hasn't for the last year so they are giving up"? If so I think that is a wonderful opportunity for people that understand the nuance of these systems and the concept of timing.
I love everything about this direction except for the insane inference costs. I don’t mind the training costs, since models are commoditized as soon as they’re released. Although I do worry that if inference costs drop, the companies training the models will have no incentive to publish their weights because inference revenue is where they recuperate the training cost.
Either way… we badly need more innovation in inference price per performance, on both the software and hardware side. It would be great if software innovation unlocked inference on commodity hardware. That’s unlikely to happen, but today’s bleeding edge hardware is tomorrow’s commodity hardware so maybe it will happen in some sense.
If Taalas can pull off burning models into hardware with a two month lead time, that will be huge progress, but still wasteful because then we’ve just shifted the problem to a hardware bottleneck. I expect we’ll see something akin to gameboy cartridges that are cheap to produce and can plug into base models to augment specialization.
But I also wonder if anyone is pursuing some more insanely radical ideas, like reverting back to analog computing and leveraging voltage differentials in clever ways. It’s too big brain for me, but intuitively it feels like wasting entropy to reduce a voltage spike to 0 or 1.
Inference costs at least seem like the thing that is easiest to bring down, and there's plenty of demand to drive innovation. There's a lot less uncertainty here than with architectural/capability scaling. To your point, tomorrow's commodity hardware will solve this for the demands of today at some point in the future (though we'll probably have even more inference demand then).
> I love everything about this direction except for the insane inference costs.
If this direction holds true, ROI cost is cheaper.
Instead of employing 4 people (Customer Support, PM, Eng, Marketing), you will have 3-5 agents and the whole ticket flow might cost you ~20$
But I hope we won't go this far, because when things fail every customer will be impacted, because there will be no one who understands the system to fix it
I worry about the costs from an energy and environmental impact perspective. I love that AI tools make me more productive, but I don't like the side effects.
Environmental impact of ai is greatly overstated. Average person will make bigger positive impact on environment by reducing his meat intake by 25% compared with combined giving up flying and AI use.
This is the wrong way to see it. If a technology gets cheaper, people will use more and more and more of it. If inference costs drop, you can throw way more reasoning tokens and a combination of many many agents to increase accuracy or creativity and such.
> throw way more reasoning tokens and a combination of many many agents to increase accuracy or creativity and such.
But this is just not true, otherwise companies that can already afford such high prices would have already outpaced their competitors.
No company at the moment has enough money operate with 10x the reasoning tokens of their competitors because they're bottlenecked by GPU capacity (or other physical constraints). Maybe in lab experiments but not for generally available products.
And I sense you would have to throw orders of magnitude more tokens to get meaningfully better results (If anyone has access to experiments with GPT 5 class models geared up to use marginally more tokens with good results please call me out though).
I mean theoretically if there are many competitiors the costs of the product should generally drop because competition.
Sadly enough I have not seen this happening in a long time.
I think that as a user I'm so far removed from the actual (human) creation of software that if I think about it, I don't really care either way. Take for example this article on Hacker News: I am reading it in a custom app someone programmed, which pulls articles hosted on Hacker News which themselves are on some server somewhere and everything gets transported across wires according to a specification. For me, this isn't some impressionist painting or heartbreaking poem - the entity that created those things is so far removed from me that it might be artificial already. And that's coming from a kid of the 90s with some knowledge in cyber security, so potentially I could look up the documentation and maybe even the source code for the things I mentioned; if I were interested.
Art is and has always been about the creator.
I don't want software that is built to be art. I want software that is built to provide facilities.
Cool, but it's actually not all about you (the consumer) at all.
Take a walk in any museum, I'm pretty sure you'll react to some of the art displayed there and find it cool before you read the name of the artist.
Dive into a forest, you'll find a couple of cool trees.
Art isn't about being cool. Art is about context.
When I tell people that art cannot be unpolitical, they react strongly, because they think about the left/right divide and how divided people are, where art is supposed to be unifying.
But art is like movement, you need an origin and a destination. Without that context, it will be just another... thing. Context makes it something.
It's not that you know the artist first and then say "this art is cool because I like the artist". The art is the means by which you know the artist. The more of their works you encounter, the closer you get to understanding the artist and what they are trying to communicate.
Of course. And yet, people still read the name and backstories anyways.
We haven’t been inching closer to users writing a half-decent ticket in decades though.
Solutions like https://bugherd.com/ might make the issue context capture part more accurate.
Maybe the agent can ask the user clarifying questions. Even better if it could do it at the point of submission.
Feedback loops like that would be an exercise in raising garbage-in->garbage-out to exponential terms.
It's the "robots will just build/repair themselves" trope but the robots are agents
Yes. Next they'll want nanobots that build/repair themselves.
Oh wait. That's already here and is working fine.
Tusted user like Jia Tan.
I think Anthropic will launch backend hosting off the back of their Bun acquisition very soon. It makes sense to basically run your entire business out of Claude, and share bespoke apps built by Claude code for whatever your software needs are.
100% its going to happen - also OpenAI will do same, there were already rumors about them building internal "github" which is stepping stone for that Also it is requirement for completing lock-in - the dream for these companies.
Ha I just SPECed out a version of this. I have a simple static website that I want a few people to be able to update.
So, we will give these 3 or 4 trusted users access to an on-site chat interface to request updates.
Next, a dev environment is spun up, agent makes the changes, creates PR and sends branch preview link back to user.
Sort of an agent driven CMS for non-technical stakeholders.
Let’s see if it works.
Users are often incorrect about what the software should actually be doing and don’t see the bigger picture.
I think some type of tickets can be done like this but your trusted user assumption does a lot of work here. Now I don't see this getting better than that with the current architecture of LLMs, you can do all sorts of feedback mechanisms which helps but since LLMs are not conscious drift is unavoidable unless there is a human in the loop that understands and steers what's going on.
But I do think even now with certain types of crud apps, things can be largely automated. And that's a fairly large part of our profession.
In the past three weeks a couple of projects I follow have implemented AI tools with their own github accounts which have been doing exactly this. And they appear to be doing good work! Dozens of open issues iterated, tested and closed. At one point i had almost 50 notification for one projects backlog being eradicated in 24 hours. The maintainer reviewed all of it and some were not merged.
What kind of software are people building where AI can just one shot tickets? Opus 4.6 and GPT 5.4 regularly fail when dealing with complicated issues for me.
GPT 5.4 straight up just dies with broken API responses sometimes, let alone when it struggles with a even moderately complex task.
I still can't get a good mental model for when these things will work well and when they won't. Really does feel like gambling...
Not just complicated, but even simple ones if the current software is too “new” of a pattern they’ve never seen before or trained on.
I dunno if Rust async or native platform API's which have existed for years count as new patterns, but if you throw even a small wrench in the works they really struggle. But that's expected really when you look at what the technology is - it's kind of insane we've even gotten to this point with what amounts to fancy autocomplete.
Of course not all tickets are complex. Last week I had to fix a ticket which was to display the update date on a blog post next to the publish date. Perfect use case for AI to one shot.
i dont see anyone sane trusting ai to this degree any time soon, outside of web dev. the chances of this strategy failing are still well above acceptable margins for most software, and in safety critical instances it will be decades before standards allow for such adoption. anyway we are paying pennies on the dollar for compute at the moment - as soon as the gravy train stops rolling, all this intelligence will be out of access for most humans. unless some more efficient generalizable architecture is identified.
> as soon as the gravy train stops rolling, all this intelligence will be out of access for most humans. unless some more efficient generalizable architecture is identified.
All Chinese labs have to do to tank the US economy is to release open-weight models that can run on relatively cheap hardware before AI companies see returns.
Maybe that's why AI companies are looking to IPO so soon, gotta cash out and leave retail investors and retirement funds holding the bag.
They could still eliminate relatively cheap hardware.
I mean, they have been doing that for at least a year, and I haven't seen signs of US economy tanking?... You need to find some better arguments
i was under the impression that we were approaching performance bottlenecks both with consumer GPU architecture and with this application of transformer architecture. if my impression is incorrect, then i agree it is feasible for china to tank the US economy that way (unless something else does it first)
I think it just needs to be efficient or small enough for companies to deploy their own models on their hardware or cloud, for more inference providers to come out of the woodwork and compete on price, and/or for optimized models to run locally for users.
Regarding the latter, smaller models are really good for what they are (free) now, they'll run on a laptop's iGPU with LPDDR5/DDR5, and NPUs are getting there.
Even models that can fit in unified 64GB+ memory between CPU & iGPU aren't bad. Offloading to a real GPU is faster, but with the iGPU route you can buy cheaper SODIMM memory in larger quantities, still use it as unified memory, eventually use it with NPUs, all without using too much power or buying cards with expensive GDDR.
Qwen-3.5 locally is "good enough" for more than I expected, if that trend continues, I can see small deployable models eventually being viable & worthy competition, or at least being good enough that companies can run their own instead of exfiltrating their trade secrets to the worst people on the planet in real-time.
Several fintechs like Block and Stripe are boasting thousands of AI-generated PRs with little to no human reviews.
Of course it's in the areas where it doesn't matter as much, like experiments, internal tooling, etc, but the CTOs will get greedy.
I don't think anybody is doubting its ability to generate thousands of PR's though. And yes, it's usually in the stuff that should have been automated already regardless of AI or not.
Depends on your circle. On HN I would argue that there are still a fair number of people that would be surprised to see what heavy organizational usage of AI actually looks like. On a non programming online group, of which I am a member of several, people still think that AI agents are the same as they were in mid 2025 and they can't answer "how many R's are in the following word:". Same thing even when chatting with my business owner friends. The majority of the public has no clue of the scale of recent advancement.
these companies contribute to swathes of the west's financial infrastructure, not quite safety critical but critical enough, insane to involve automation here to this degree
Even in webdev it rots your codebase unchecked. Although it's incredibly useful for generating UI components, which makes me a very happy webslopper indeed.
im grateful to have never bothered learning web dev properly, it was enlightening witnessing chat gpt transform my ten second ms paint job into a functional user interface
I don't know if this is the future, but if it is, why bother building one version of the software for everyone? We can have agents build the website for each user exactly the way they want. That would be the most exciting possibility to come out of AI-generated software.
"why bother building one version of the software for everyone?"
So one user's experience is relevant to another, so they can learn from one another?
The missing piece for me is post-hoc review.
A PR tells me what changed, but not how an AI coding session got there: which prompts changed direction, which files churned repeatedly, where context started bloating, what tools were used, and where the human intervened.
I ended up building a local replay/inspection tool for Claude Code / Cursor sessions mostly because I wanted something more reviewable than screenshots or raw logs.
I know a company already operating like this in the fintech space. I foresee a front page headline about their demise in their future.
I dont mean this as a shade but ppl who are not coders now seem to think "coding is now solved" and seem to be pushing absurd ideas like shipping software with slack messages. These ppl are often high up in the chain and have never done serious coding.
Stripe is apparently pushing gazzaliion prs now from slack but their feature velocity has not changed. so what gives?
how is that number of pr is now the primary metric of productivity and no one cares about what is being shipped or if we are shipping product faster. Its total madness right now. Everyone has lost their collective minds.
I ask myself the same question.
I'm not seeing the apps, SaaS, and other tools I use getting better, with either more features or fewer bugs.
Whatever is being shipped, as an end user, I'm just not seeing it.
cto and ceo are now feeling insane pressure to show how they are using ai but its not evident in output. So now they've resorted to blabbering publicly about prs, lines of code ect to save face. And ofcourse ppl giving them voice and platform have their own agendas that prevent them from asking "so what exactly have you shipped stripe from million pr/day".
Its baffling to see these comments on hacknernews though. I guess you have to prove that you are not a luddite by making "ai forward" predictions and show that you "get it"
I think a lot of SWE roles are really bullshit jobs (1) and these have been particularly susceptible to getting sniped with AI tools.
(1) https://en.wikipedia.org/wiki/Bullshit_Jobs
Or perhaps we end up where all software is self evolving via agents… adjusting dynamically to meet the users needs.
The "user" being the one that's in charge of the AI, not the person on the receiving end.
Instead of having a trusted user, you can also do statistics on many users.
(That's basically what A/B testing is about.)
"Trusted user" also can be an Agent.
What you're describing is absolutely where we're headed.
But the entire SWE apparatus can be handled.
Automated A/B testing of the feature. Progressive exposure deployment of changes, you name it.
Haha sure, let's just let every user add their feedback to the software.
I think the Ai agent will directly make a PR - tickets are for humans with limited mental capacity.
At least in my company we are close to that flywheel.
Tickets need to exist purely from a governance perspective.
Tickets may well not look like they do now, but some semblance of them will exist. I'm sure someone is building that right now.
No. It's not Jira.
Yes, so my point is that PRs act as that governance layer - with preview environments, you can see the complexity and risk of the change etc.
The agents have even more limited capacity
At the moment, maybe. But it's growing.
Even so they would probably still benefit from intermediate organisational steps.
For a while, sure.
I am already there with a project/startup with a friend. He writes up an issue in GitHub and there is a job that automatically triggers Claude to take a crack at it and throw up a PR. He can see the change in an ephemeral environment. He hasn't merged one yet, but it will get there one day for smaller items.
I am already at the point where because it is just the two of us, the limiting factor is his own needs, not my ability to ship features.
Must be nice working on simple stuff.
Why doesn’t he merge them?
He is not technical but a product guy, so he still wants me to check it over.
We do feedback to ticket automatically
We dont have product managers or technical ticket writers of any sort
But us devs are still choosing how to tackle the ticket, we def don't have to as I’m solving the tickets with AI. I could automate my job away if I wanted, but I wouldn't trust the result as I give a degree of input and steering, and there’s bigger picture considerations its not good at juggling, for now
Then sets up telemetry and experiments with the change. Then if data looks good an agent ramps it up to more users or removes it.
> I feel like we are just inching closer and closer to a world where rapid iteration of software will be by default.
There's a lots of experimentation right now, but one thing that's guaranteed is that the data gatekeepers will slam the door shut[1] - or install a toll-booth when there's less money sloshing about, and the winners and losers are clear. At some point in the future, Atlassian and Github may not grant Anthropic access to your tickets unless you're on the relevant tier with the appropriate "NIH AI" surcharge.
1. AI does not suspend or supplant good old capitalism and the cult of profit maximization.
Um, we are already there...
I feel like a lot of people and companies wanted to automate the web, but most website's operators wouldn't let you and would block you. Now you put the name AI into and now you're allowed to do It.
I remember when I tried to set something up with the ChatGPT equivalent like "notify me only if there are traffic disruptions in my route every morning at 8am" and it would notify me every morning even if there was no disruption.
This is because for some reason all agentic systems think that slapping cron on it is enough, but that completely ignores decades of knowledge about prospective memory. Take a look at https://theredbeard.io/blog/the-missing-memory-type/ for a write-up on exactly that.
“A programmer is going to the store and his wife tells him to buy a gallon of milk, and if there are eggs, buy a dozen. So the programmer goes shopping, does as she says, and returns home to show his wife what he bought. But she gets angry and asks, ‘Why’d you buy 13 gallons of milk?’ The programmer replies, ‘There were eggs!’”
You need to write a clearer prompt.
"I need to fly to NY next weekend, make the necessary arrangement".
Your AI assistant orders an experimental jetpack from a random startup lab. Would you have honestly guessed that the prompt was "ambiguous" before you knew how the AI was going to act on it ?
Did GP edit their comment? Or did you read the prompt they used somewhere else?
Why not set your own evals and something like pi-mono for that? https://github.com/badlogic/pi-mono/
You'll define exactly what good looks like.
Me too. It doesn't have ability to alert only on true positive. I has to also alert on true negative. So dumb
This doesn't seem to hard to solve except for the ever so recurring llm output validation problem. If the true positive is rare you don't know if the earthquake alert system works until there's an earthquake.
... just force the data into a structured format, then use "hard code" on the structure.
"Generate the following JSON formatted object array representing the interruptions in my daily traffic. If no results, emit []. Send this at 8am every morning. {some schema}. Then run jsonreporter.py"
Then just let jsonreporter.py discriminate however it likes. Keep the LLMs doing what they are good at, and keep hard code doing what it's good at.
I do feel people will end up using this for things where a deterministic rule could be used - more effective, faster and cheaper. See this starting to happen at work...'We need AI to solve X....no you don't"
Maybe. The problem of "execute task on a cron" is something I've noticed the industry seems to refuse to solve in general, as if intentionally denying this capability for regular people. Even without AI, it's the most basic block of automation, and is always mysteriously absent from programs and frameworks (at least at the basic level). AI only makes it more useful on "then" side, but reliable cron on "if" side is already useful.
Most of the industry today is educated to avoid manual hacky solutions on single servers. You need to have fancy UI, frameworks with easy feedback and layers on top of layers who maintain other layers. Cron is an ancient tool with arcane syntax which offer barely anything out of the box, you have to know it and work it to get something out of it.
And there is also the mindset to avoid boring loops, and prefer event driven solutions for optimal resource-usage. So people also have a kind of blind spot for this functionality.
I don’t recall if IFTTT had/has a basic cron or not, but it sure has/had put a lot of basic automations in the hands of the general public. Same for Apple Shortcuts, to some extent, or Zapier.
This is a larger topic that's worthy of a comparably large rant, which I really don't want to do right now, but to keep it short, in my subjective view:
- IFTTT was great when it started; at some point, it became... weird, in a "I don't even know what's going on on my screen, is this a poster or an app" kind of way.
- Zapier is an unpenetrable mess, evidently targets marketers and other business users; discovery is hard, and even though it seems like it has everything, it - like all tools in this space - is always missing the one feature you actually need.
- Yahoo Pipes, I heard they were great, but I only learned about them after they shut down.
- Apple Shortcuts - not sure what you can do with those, but over the years of reading about them in HN comments, I think they may be the exception here, in being both targeting regular users and actually useful.
- Samsung Modes and Routines - only recently becoming remotely useful, so that's nice, even if vendor-restricted.
- Tasker - an Android tool that actually manages to offer useful automation, despite the entire platform/OS and app ecosystem trying its best to prevent it. Which is great, if your main computer is a phone. It sucks in a world of cloud/SaaS, because it creates a silly situation where e.g. I could nicely automate some things involving e-mail and calendars from Tasker + FairEmail, but... well my mailboxes and calendars lives in the cloud so some of that would conflict with use of vendor (Fastmail) webapp or any other tool.
Or, in short: we need Tasker but for web (and without some of the legacy baggage around UI and variable handling).
The sorry state of automation is not entirely, or even mostly, the fault of the automation platforms. I may have issues with some UI and business choices some of these platforms made, but really, the main issue is that integrations are business deals and the integrated sides quickly learned to provide only a limited set of features - never enough to allow users to actually automate use of some product. There's always some features missing. You can read data but not write it. You can read files and create new files but not edit or delete them. You can add new tasks but can't get a list of existing ones. Etc.
It's another reason LLMs are such a great thing to happen - they make it easy (for now) to force interoperability between parties that desperately want to prevent it. After all, worst case, I can have the LLM operate the vendor site through a browser, pretending to be a human. Not very reliable, but much better than nothing at all.
Similarly short on reply here, but quickly: IFTTT: hah, I agree. It was awesome when it was more about IoT than Spotify to Google Sheets.
And re: Zapier: yes, that’s the key to Zapier, from my experience: usage in marketing and the “power user” base.
Re: shortcuts: (I live in the Apple ecosystem) Shortcuts + AppleScript is gold on macOS. Shortcuts + iOS is about to be game changing - it already changed the game, it’s just nobody has been playing it, because it’s not “fun”.
After Siri+Gemini+Shortcuts, everyone will be playing it, I suspect, even on Android, it will get built somehow.
> Or, in short: we need Tasker but for web (and without some of the legacy baggage around UI and variable handling).
n8n, node-RED and others already exist. There are many tools for automations, and I guess most of them can also do cron-like jobs.
Node RED is still unwieldy for the masses, as easy as it is for a consumer to install, it’s not necessarily as easy to use.
Consumer grade automations built on node-RED? I suppose it depends on the market, but most people aren’t going to want to fiddle with it, I suspect.
A plugin for Chrome might be able to take off though, or some killer mobile app, but it needs to run on a cheap phone and control things without having to keep track of loops and logic and variables and all the fun stuff.
None of the tools here are for the masses. Automation in itself is already hard to grasp for the average user, and while some of those are simpler to start than others, they all are wall to climb.
Agree. How would you solve this in general, what would be the ingredients? People use things like zapier, n8n, node-red to achieve this today but in many cases are overkill.
Honestly, you just need cron (and Ruby/Python/bash/whatever) on an EC2. It's not very fashionable, but it works, will continue to work forever, and costs hardly anything.
To use an example in the article, what does
> Analyzing CI failures overnight and surfacing summaries
Look like on ec2 with python? Because with Claude, it’s that prompt, and with your solution it’s infra + security groups + multiple APIs + whatever code you actually write
I would suggest the prompt is an example of garbage in that's going to produce garbage out. Sitting down to confront the problem you're solving will show this, while Claude is going to happily spit out what looks like a plausibly functional system.
So for example the only "analysis" of CI failures are which systems failed and who/what committed the changes to those things. The only way AI would help me here is if the system was so jank that the sole primitive i can use is textual analysis of log files. Which granted is probably real for a lot of software firms, but I really hope I have better build and test infrastructure than that.
> I would suggest the prompt is an example of garbage in that's going to produce garbage out. Sitting down to confront the problem you're solving will show this, while Claude is going to happily spit out what looks like a plausibly functional system.
I think this shows the value.
> Which granted is probably real for a lot of software firms
Here's the rub though; for many many people it's a huge improvement over what they have right now.
I'd start with solving the UX issues, specifically expectations and UI around scheduling jobs.
Expectations - the functionality of "do X on a timer" needs to be offered to users as a proper end-user feature[0], not treated as a sysadmin feature (Windows, Linux) or not provided at all (Android). People start seeing it on their own devices, they'll start using it, then expecting it, and the web will adjust too[1].
UI - somehow this escapes every existing solution, from `cron` through Windows timers to any web "on timer" event trigger in any platform ever. There already exists a very powerful UI paradigm for managing recurring tasks, that most normies know how to use, because they're already using it daily at work and privately: a calendar. Yes, that thing where we can set and manage recurring events, and see them at a glance, in context of everything else that's going on in our lives.
--
<rant>
I know those are hard problems, but are hard mostly because everybody wants to be the fucking one platform owning users and the universe. This self-inflicted sickness in computing is precisely why people will jump at AI solutions for this. Why I too will jump on this: because it's easier than dealing with all the systems and platforms that don't want to cooperate.
After all, at this point, the easiest solution to the problems I listed above, and several others in this space, would be to get an AI agent that I can:
1) Run on a cron every 30 minutes or so (events are too complicated);
2) Give it read (at minimum) access to my calendar and todo lists (the ones I use, but I'm willing to compromise here);
3) Give it access to other useful tools
Which I guess brings us to the actual root problem here. "Run tasks on a cron" and "run tasks on trigger" are basically just another way of saying unattended/non-interactive usage. That is what is constantly being denied end users.
This is also the key to enabling most value of AI tools, too, and people understand it very well (see the popularity of that Open Claw thing as the most recent example), but the industry also lives in denial, believing that "lethal trifecta" is a thing that can be solved.
</rant>
--
[0] - This extends to event triggers ("if X happens, then") automation, and end-user automation in all of every-day life. I mean, it's beyond ridiculous that the only things normal people are allowed to run automatically are dishwasher, and a laundry machine (and in the previous era, VCRs).
[1] - As a side effect, it would quickly debullshitify "smart home" / "internet of things" spaces a lot. The whole consumer side of the market revolves around selling people basic automation capabilities - except vendor-locked, and without the most useful parts.
> See this starting to happen at work...'We need AI to solve X....no you don't"
Same. Sometimes it is just people overeager to play with new toys, but in our case there is a push from the top & outside too: we are in the process of being subsumed into a larger company (completion due on April the 1st, unless the whole thing is an elaborate joke!) and there is apparently a push from the investors there to use "AI" more in order to not "get left behind the competition".
Its self perpetuating, I was talking to CEO of a Series A level B2B SaaS company here in UK recently. Most of the propspects his sales team are hitting are re-allocating their wallets to only looking for products that use AI on back of senior management pushing them to do so.
This company already does some pretty cool stuff with statistics for forecasting but now they are pivoting their roadmap to bake in GenAI into their offering over some other features that would be more valuable to their clients.
I'd say that's almost fine if they can start expressing intent correctly and thinking what good looks like. They (or some automated thing if you're building "think for them" type of products instead of "give them tools and teach them to think how to use them") can then freeze determism more and more were useful
I wrote this to help people (not just Devs) reason about agent skills
https://alexhans.github.io/posts/series/evals/building-agent...
And this one to address the drift of non determism (but depending on the audience it might not resonate as much)
https://alexhans.github.io/posts/series/evals/error-compound...
I feel this would be more useful for tasks like "Check website X to see if there are any great deals today". Specifically, tasks that are loosely defined and require some form of intuition.
The problem I'd think, for the average user, would be writing the 'then' part of any deterministic rule — that would require coding, or at least some kind of automation script (visual or otherwise) that's basically coding in a trench coat, which for most people is still a barrier to entry and annoying. I think that's why they'd use AI tbh — they can just describe what they want in natural language with AI.
AI will become this colleague who sucks at everything, but never says no, so he becomes the favorite go-to person.
People are loading huge interpreted environments for stuff that can be done from the command line. Run computations on complex objects where it could be a single machine instruction etc. The trend has been around for a long time.
Standard pendulum swing. Most people want to disengage their thinking circuits most of the time, so problems can't be evaluated one by one. There is no such thing as "this is a good solution for some problems". It can only be "this is a good solution for all problems". When the pendulum swings this far, this hard, it will swing all the way back eventually.
I've recently switched from GitHub Copilot Pro to Claude Code Max (20x). While Claude is clearly superior in many aspects, one area where it falls short is remote/cloud agents.
Yesterday, I spent the entire day trying to set up "Claude on the web" for an Elixir project and eventually had to give up. Their network firewall kept killing Hex/rebar3 dependency resolution, even after I selected "full" network access.
The environment setup for "on the web" is just a bash script. And when something goes wrong, you only see the tail of the log. There is currently no way to view the full log for the setup script. It's really a pain to debug.
The Copilot equivalent to "Claude on the web" is "GitHub Copilot Coding Agents," which leverages GitHub Actions infrastructure and conventions (YAML files with defined steps). Despite some of the known flaws of GitHub Actions, it felt significantly more robust.
"Schedule task on the web" is based on the same infrastructure and conventions as "Claude on the web", so I'm afraid I'm gonna have the same troubles if I want to use this.
Looks like I'm limited to only 3 cloud scheduled tasks. And I'm on the Max 20x plan, too :(
"Your plan gets 3 daily cloud scheduled sessions. Disable or delete an existing schedule to continue."
But otherwise, this looks really cool. I've tried using local scheduled tasks in both Claude Code Desktop and the Codex desktop app, and very quickly got annoyed with permissions prompts, so it'll be nice to be able to run scheduled tasks in the cloud sandbox.
Here are the three tasks I'll be trying:
Every Monday morning: Run `pnpm audit` and research any security issues to see if they might affect our project. Run `pnpm outdated` and research into any packages with minor or major upgrades available. Also research if packages have been abandoned or haven't been updated in a long time, and see if there are new alternatives that are recommended instead. Put together a brief report highlighting your findings and recommendations.
Every weekday morning: Take at Sentry errors, logs, and metrics for the past few days. See if there's any new issues that have popped up, and investigate them. Take a look at logs and metrics, and see if anything seems out of the ordinary, and investigate as appropriate. Put together a report summarizing any findings.
Every weekday morning: Please look at the commits on the `develop` branch from the previous day, look carefully at each commit, and see if there are any newly introduced bugs, sloppy code, missed functionality, poor security, missing documentation, etc. If a commit references GitHub issues, look up the issue, and review the issue to see if the commit correctly implements the ticket (fully or partially). Also do a sweep through the codebase, looking for low-hanging fruit that might be good tasks to recommend delegating to an AI agent: obvious bugs, poor or incorrect documentation, TODO comments, messy code, small improvements, etc.
I ran all of these as one-off tasks just now, and they put together useful reports; it'll be nice getting these on a daily/weekly basis. Claude Code has a Sentry connector that works in their cloud/web environment. That's cool; it accurately identified an issue I've been working on this week.
I might eventually try having these tasks open issues or even automatically address issues and open PRs, but we'll start with just reports for now.
0 7 * * 1-5 ANTHROPIC_API_KEY=sk-... /path/to/claude-cron.sh /path/to/repo >> ~/claude-reports.md 2>&1
Seems trivial.
A trivial way to rack up hundreds of dollars in API costs, sure.
But you can set up a claude -p call via a cronjob without too much hassle and that can use subscriptions.
Sure, now what happens if my laptop is asleep at 7am? Or if our scheduled build took an extra 30 minutes because of contention?
Scheduling is easy. The hard part is everything between "started" and "done" - task needs human approval at step 3, fails at step 5 (retry from 4 or from scratch?), takes 6 hours and something restarts. How do they handle tasks that span multiple inference calls? Is there checkpointing or does it start over?
Claude is moving fast.
https://grok.com/tasks
Grok has had this feature for some time now. I was wondering why others haven't done it yet.
This feature increases user stickiness. They give 10 concurrent tasks free.
I have had to extract specific news first thing in the morning across multiple sources.
This is a bit restrictive, doesn't take screenshots. So you can't "say take screenshots of my homepage and send it to me via email"
It doesnt allow egress curl, apart from few hardcoded domains.
I have created Cronbox in the cloud which has a better utility than above. Did a "Show HN: Cronbox – Schedule AI Agents" a few days back.
https://cronbox.sh
and a pelican riding a bicycle job -
https://cronbox.sh/jobs/pelican-rides-a-bicycle?variant=term...
We need to fight model providers trying to own memory, workflows and tooling. Don't give them an inch more of your software than needed even if there is a slight inconvenience setting up.
I have tasks files in the code base that Claude executes on a schedule. I can easily move to other agents.
Why? As a user of these tools, I love the convenience factor of having one tool rather than wrangling dozens. It's why in the past I've used an IDE (JetBrains), a language created by the provider of the IDE (Kotlin), web framework created by the same people (ktor), etc.
This is very different to a framework, language or IDE. More comparable to apple or amazon trying to create corporate anti competitive hellscapes of enslaved users that have no agency, no dignity and no real choice, reduced to rent extraction targets. Just with much more dire consequences and much more at stake. We still have the power to make ai providers have no moat and be interchangeable commodity. But we have to fight for them to not get control of the other layers they are trying to grab. We are in a war, people who can still use claude code or other of their garbage tools, after anthropic threatened and shut off opencode, are very naive and ignorant.
From an outside perspective, this sounds hyperbolic. I don’t know why task scheduling would be a part of a war.
In fact, I re-read the article before submitting this comment just to make sure I wasn’t missing something. What on earth is so polarizing about a prompt being run recurrently? It’s a long-awaited feature that I’ve personally needed.
If you want to win your war, you’ll need better propaganda to recruit people. Start with me. My mind is open. Why should I join?
Please tie your claims concretely to this new feature. I’m interested in how adding this could erode open source software. To me they seem completely independent, and it’s a welcome change.
I can't remove the YouTube app off my phone. The mobile phone is a locked up landscape that hates general purpose computing that puts the owner of the device in control. In the same way the big LLM want to give you stuff for free / subsidized then become very opinionated about how you use this stuff then pave up the entire landscape and monopolize it for themselves. Screw that.
We are at a war of defending control over our tools from AI companies that try to takeover any adjacent technology and anything that can be turned into a platform with lock- in effect. Subsidising subscriptions and locking people into their cli is just the start.
"A scheduled task runs a prompt on a recurring cadence using Anthropic-managed infrastructure." >> There is no other way to read this as in this context, its just a small feature, but its a land grab to run workflows locked into their cloud not just models, we don't fall for regimes in one go but one tiny piece at a time, like the frog in the water.
I paid a lot of attention to the opencode drama, and I still have a lot of respect for Dax, Adam, and the rest of that team. What I saw was a startup seeking to use API keys specific to Anthropic's subscription model, subsidized and intended for use solely by Anthropic's provided tooling. Anthropic also has an API usage-based model, for companies who want to create tooling around Anthropic models or integrate the models in their own products.
Except you can write Kotlin and ktor outside of Jetbrain's IDEs.
Anthropic wants a world where they own your agent where it can't exist outside of the Claude desktop app or Claude Code.
There could exist a world where your agent isn't confined by the whims of a corporation.
> Anthropic wants a world where they own your agent where it can't exist outside of the Claude desktop app or Claude Code.
Please. I'm sure you're referring to their locking down of subscription keys, which of course they are going to have restrictions on. It's a subsidized subscription model.
You've always been able to create a platform account and use API keys with usage-based billing, and that will never go away. Charging enough to make a profit on inference isn't exactly rent-seeking or whatever language you want to use to villainize a company trying to make enough revenue to survive.
I wish there was a company that was easy to use but wouldn't sell out in this arena.
hi, I don’t normally promote here, but I feel compelled to ask if you’d like to test my thing. it’s a personal agent / API for creating and managing background cloud agents that I’m 100% committed to keeping open source & accessible as an alternative platform to putting all your eggs in one basket. there is also a desktop app and expanding the api to involve storage. kind of like agentic dropbox that can also do coding and has a full computer and ability to spin up N agents
https://tinyfat.com
Very much like the idea. Thanks for sharing. Noticed that you are pushing this fully anonymously and wanted to chat with you regarding a project that I’m building. Mind contacting me on the address in my profile?
thats like looking for a unicorn.
>slight inconvenience
You misspelt ">95% discount relative to API pricing" ;)
I can't pick the effort for the tasks run on Claude Web. I have a feeling Claude is using low or medium effort on those tasks, and I observe clear quality differences with the task ran on my local claude code, which uses high effort.
One interesting restriction is that it won’t do anything with people’s faces.
I run conferences and I like to have photos of delegates on the page so you can see who else is attending.
I wanted to automate this by having Claude go to the person’s LinkedIn profile and save the image to the website.
But it seems it won’t do that because it’s been instructed not to.
LinkedIn already employs anti-scraping measures, so I'd expect a lot of users to get flagged.
That's not unique to LinkedIn but what is somewhat unique is the strong linkage to real world identities, which raises the cost of Sybil attacks on personal networks with high trust.
i'm missing something basic here .... what does it actually do? It executes a prompt against a git repository. Fine - but then what? Where does the output go? How does it actually persist whatever the outcome of this prompt is?
Is this assuming you give it git commit permission and it just does that? Or it acts through MCP tools you enable?
MCP tools. We're doing some MCP bundling and giving it here, pretty cool stuff.
wasn't MCP a critical link in the recent litellm attack?
And if it was?
It's a bit like asking if "an API" was a critical link in some cybersec incident. Yes, it probably was, and?
i'd say it's more like intentionally choosing to use naive string interpolation for SQL queries than a trusted library's parameter substitution. Both work.
There is no "parameter substitution" equivalent possible. Prompt injection isn't like SQL injection, it has no technical solution (that isn't AGI-complete).
Prompt injection is "social engineering" but applied to LLMs. It's not a bug, it's fundamentally just a facet of its (LLM/human) general nature. Mitigations can be placed, at the cost of generality/utility of the system.
> It's not a bug, it's fundamentally just a facet of its (LLM/human) general nature
Fair enough but then that means that MCP is not "a bit like asking if "an API" was a critical link in some cybersec incident"
Because I can secure an API but I can't secure the the "(LLM/human) general nature."
MCP itself is just an API. Unless the MCP server had a hidden LLM for some reason, it's still piece of regular, deterministic software.
The security risk here is the LLM, not the MCP, and you cannot secure the LLM in such system any more you can secure user - unless you put that LLM there and own it, at which point it becomes a question of whether it should've been there in the first place (and the answer might very well be "yes").
We use to do do automated sec audits weekly on the code base and post the result on slack
so is slack posting an MCP tool it has? or a skill it just knows?
In Claude it is a "connector" which is essentially an mcp tool.
Oh my, did Anthropic invent Cron jobs as a service?
It's a game changer.
Edit: my mistake. It's inferior to a Cron job. If my repos happen to be self hosted with Forgejo or codeberg, then it won't even work. If I concede to use GitHub though I don't have to set up any env variables. Schedules lock-in, all over the web.
You jest, but for some reason the industry stubbornly refuses to solve the "cron job as a service" problem for end-users, whether on the web or in the OS.
I feel this is rooted in problems that extend beyond computing. Regular people are not allowed to automate things in their life. Consider that for most people, the only devices designed to allow unattended execution off a timer are a washing machine, some ovens and dishwashers, and an alarm clock (also VCRs in the previous era). Anything else requires manual actuation and staying in a synchronous loop.
There is nothing to solve. It's already there, a VPS, a container platform, just push your script and schedule it.
Of course a provider can offer convenient shortcuts, but at the cost of getting tied into their ecosystem.
Anthropic is clearly battling an existential threat: what happens when our paying users figure out they can get a better and cheaper model elsewhere.
> what happens when our paying users figure out they can get a better and cheaper model elsewhere.
They solved that with subscriptions. For end-users (and developers using AI for coding), it makes no sense to go for pay-as-you-go API use, as anything interesting will burn more than the monthly subscription worth of $$$ in API costs in few hours to days.
Yes but that's anthropic API pricing, some of the highest per token.
Sure subscription is a sort of tie in, but only if users are fooled into investing in workflows bound to anthropic. That's what the company is hooking them to do with this scheduler, banning open agentic framework and the rest.
The moat, if any, will be the tooling. Token is becoming a commodity, they know it.
> for some reason the industry stubbornly refuses to solve the "cron job as a service" problem for end-users, whether on the web or in the OS.
Such a service will always be destroyed by the bell-ends who want to run spam or worse activities.
That doesn't explain lack of such functionality at the OS/platform level. It technically exists on Linux and Windows, but is heavily optimized towards sysadmin use, and essentially hidden from regular users on the "normie UI surface". Most people don't even realize their computers could do things on a timer.
(And on Android, AFAIK there's exactly nothing at all. There's not even common support for any kind of basic automation; only recent exception is Samsung. From third-party apps, there's always been Tasker - very powerful, but the UX almost makes you want to learn to write Android apps instead.)
What is wrong with things like the Zapier scheduler? (ie https://zapier.com/apps/schedule/integrations) For running locally, there's also a plethora of cronlikes for every OS under the sun.
I think the core problem is not so much that it is not "allowed", but that even the most basic types of automation involves programming. I mean "programming" here in the abstract sense of "methodically breaking up a problem into smaller steps and control flows". Many people are not interested in learning to automate things, or are only interested until they learn that it will involve having to learn new things.
There is no secret conspiracy stopping people from learning to automate things, rather I think it's quite the opposite: many forces in society are trying to push people to automate more and more, but most are simply not interested in learning to do so. See for example the bazillion different "learn to code" programs.
It's not default. People don't need courses for this, they need availability and nudges. None of the platforms people use expose such features to users, much less encourage them to try. On the contrary, they hide or remove it from base UI layer entirely, and the UI choices made clearly suggest platform vendors don't even consider the possibility of regular people being interested.
Computing isn't, and has never been, demand-driven. It's all supply-driven. People choose from what's made available by vendors, and nobody bothers listening to user feedback.
I built this last year because I thought it was overdue back then already.
https://imgur.com/a/apero-TWHSKmJ
Cron triggers (or specific triggers per connector like new email in Gmail, new linear issue, etc for built in connectors).
Then you can just ask in natural language when (whatever trigger+condition) happens do x,y and z with any configuration of connectors.
It creates an agentic chain to handle the events. Parent orchestrator with limited tools invoking workers who had access to only their specific MCP servers.
Official connectors are just custom MCP servers and you could add your own MCP servers.
I definitely had the most advanced MCP client on the planet at that point, supporting every single feature of the protocol.
I think that's why I wasn't blown away by OpenClaw, I had been doing my own form of it for a while.
I need to release more stuff for people to play around with.
My friends had use cases like "I get too many emails from my kids school I can't stay on top of everything".
So the automation was just asking "when I get an email from my kids school, let me know if there's anything actionable for me in it"
So this is basically just Anthropic’s version of Open Claw that they manage for you and you pay them.
What's the per-unit-time compute cost (independent of tokens)? Compute deadline etc.? They don't charge for the Cloud Environment https://code.claude.com/docs/en/claude-code-on-the-web#cloud... currently running?
Here goes my project.
Better idea. Watch online feedback on this feature. Then implement things users want. Go niche. Join the forum and help them use Claude to its limits. Then be the next step for power users.
What were you working on?
Welcome to Amazon playbook replayed again, most useful, profitable and popular use-cases will implemented by platform - and they will do it ruthlessly and quickly as money needs to be recouped.
it would be easier to use claude to write a cronjob that does the same thing for you but accurately
And yet it probably covers 90% of what people use OpenClaw for.
Is this free? I don’t see pricing info. I guess just a way to make you forget that you’re spending money on tokens?
You don't spend money on tokens. It is a subscription.
The PHP script from a cron tab is back!
lmao
Is only Github supported as a repository?
This is powerful. Combined with MCPs, you can pretty much automate a ton of work.
Can you give some examples?
That feature was silent launched about week ago for me.
I use it to:
- perform review of latest changes of code to update my documentation (security policies, user documentation etc.)
- perform review to latest changes of code, triage them, deduplicate and improve code - I review them, close them with comments for over-engoneering / add review for auto-fix
- perform review of open GitHub issue with label, select the one with highest impact, comment with rationale, implement it and make pull request - I wake up and I have a few pull request to fix issues that I can approve /finish in existing Claude Code thread
I want also use it to: - review recent Sentry issues, make GitHub issues for the one with highest priority, make pull request with proposed fix - I can just wake up and see that some crash is ready to be resolved
Limit of 3 scheduled jobs is pretty impactful, but playing with it give me a nice idea on how I can reduce my manual work.