Gemini 3.0 Pro – early tests

(twitter.com)

215 points | by ukuina 3 days ago ago

128 comments

  • simonw 3 days ago ago

    I've seen a bunch of tweets like this recently, as far as I can tell they're all from people using https://aistudio.google.com/ who got served an A/B test.

    A few more in this genre:

    https://x.com/cannn064/status/1973818263168852146 - "Make a SVG of a PlayStation 4 controller"

    https://x.com/cannn064/status/1973415142302830878 "Create a single, self-contained HTML5 file that mimics a macOS Sonoma-style desktop: translucent menu bar with live clock, magnifying dock, draggable/resizable windows, and a dynamic wallpaper. No external assets; use inline SVG for icons."

    https://x.com/synthwavedd/status/1973405539708056022 "Write full HTML, CSS and Javascript for a very realistic page on Apple's website for the new iPhone 18"

    I've not seen it myself so I'm not sure how confident they are that it's Gemini 3.0.

    • ajcp 3 days ago ago

      At this point until I see one run through the Pelican Benchmark I can't really take a new model seriously.

      • diggan 3 days ago ago

        Unfortunately, as every public benchmark, once it ends up in the training sets and/or the developers aware of it, it stops being effective, and I think we've started to reach that point.

        The only thing I've found to give me some sort of quantitative idea of how good a new model is, is my own private benchmarks. It doesn't cover everything I want to use LLMs for, and only has 20-30 tests per "category", but at least I'm 99% sure it isn't in the training datasets.

        • simonw 3 days ago ago

          I have a few "SVG of an X riding a Y" tests that I don't publish online which I run occasionally to see if a model is suspiciously better at drawing a pelican riding a bicycle than some other creature on some other form of transport.

          I would be so entertained if I found out an AI lab had wasted their time cheating on my dumb benchmark!

          • ajcp 3 days ago ago

            -> I would be so entertained if I found out an AI lab had wasted their time cheating on my dumb benchmark!

            Que intro: "The gang wastes their time cheating on a dumb benchmark"

            • mcny 3 days ago ago

              A shower thought I just had: there must be some AI training company somewhere that has injested all It is always sunny in Philadelphia, not just the text but all the video from all episodes somehow...

          • diggan 3 days ago ago

            > I would be so entertained if I found out an AI lab had wasted their time cheating on my dumb benchmark!

            I don't think it's necessarily "cheating", it just happens as they're discovering and ingesting large ranges of content. A problem of public content, it's bound to be included sooner or later, directly or indirectly.

            Nice to hear you're doing some sort of contingency though, and looking forward to the inevitable blog post announcing the change to a different bird and vehicle :)

            • svachalek 3 days ago ago

              The thing is most of the discussion about it is embarrassingly bad SVGs so training on them would actually hurt their performance.

              • JSR_FDED 2 days ago ago

                Regrettably AI is still better at SVG than I am

          • jgalt212 2 days ago ago

            Your benchmark may or may not be dumb, but it is definitely widely followed. So much so this is what Bing AI has to say on the matter.

            > Absolutely — the “pelican riding a bicycle” SVG test is a quirky but clever benchmark created by Simon Willison to evaluate how well different large language models (LLMs) can generate SVG (Scalable Vector Graphics) images from a prompt that’s both unusual and unlikely to be in their training data.

          • fragmede 2 days ago ago

            But how would you know it's from what you would consider cheating as opposed to pelicans on bicycles existing in the latest training data? Obviously your blog gets fed into the training set for GPT-6, as well as everyone else talking about your test, so how would the comparison to a secret X riding a Y tell you if an AI lab is cheating as opposed to merely there being more examples in the training data?

            • simonw 2 days ago ago

              Mainly because if they train on the pelican on bicycle SVGs from my blog they are going to get some very weird looking pelicans riding some terrible looking bicycles.

              • fragmede a day ago ago

                It's not that I claiming they're training on SVG pelicans on bicycles from your blog, it's that thanks to your popularity, there are simply now more pictures of pelicans on bicycles floating around on the Internet and thus ChatGPT's training data. Eg https://www.reddit.com/r/ColoredPencils/comments/1l9l4fq/pel...

                How would you determine that improvements to SVG pelicans on bicycles (and not your secret X on Ys) are from an OpenAI employee cheating your benchmark vs being an improvement on pelicans on bicycles thanks to that picture from Reddit and everywhere elsewhere in the training data?

          • Imustaskforhelp 3 days ago ago

            Please do let us know through your blog post if you ever find AI labs to cheat on your benchmark.

            But now I am worried that since you have shared that you do SVG of an X riding a Y thing, maybe these models will try to cheat on the whole SVG of X riding Y thing instead of hyper focusing the pelican.

            So now I suppose you might need to come up with an entirely new thing though :)

            • throwup238 3 days ago ago

              There are so many X and Y combinations that I find it hard to believe they could realistically train for a even a small fraction of them. Someone has to generate the graphics output for the training.

              A duck billed platypus riding a unicycle? A man o' war riding a pyrosome? A chicken riding a Quetzalcoatlus? A tardigrade riding a surf board?

              • gnatolf 3 days ago ago

                You're assuming that given the collection of simonw's publicly available blog posts, the creativity of those combinations can't be narrowed down. Simply reverse engineer his brain this way and you'll get your Xs and Ys ;)

                • throwup238 3 days ago ago

                  I feel like that would over fit on various snakes like pythons.

              • brianjking 2 days ago ago

                I must say that I loved the idea of a tardigrade riding a surfboard. You're welcome.

                Granted not an SVG, but still awesome.

                https://imgur.com/a/KsbyVNP

              • fragmede 3 days ago ago

                If we accept ChatGPT telling me that there are approximately 200k common nouns in English, and then we square that, we get 40 billion combinations. At one second per, that's ~1200 years, but then if we parallelize it on a supercomputer that can do 100,000 per second that would only take 3 days. Given that ChatGPT was trained on all of the Internet and every book written, I'm not sure that still seems infeasible.

                • throwup238 2 days ago ago

                  It still can't satisfactorily draw a pelican on a bicycle because that's either not in the training data or the signal is too weak, so why would it be able to satisfactorily draw every random noun-riding-noun combination just because you threw a for loop at it?

                  The point is that in order to cheat on @simonw's benchmark across any arbitrary combination, they'd have to come up with an absurd number of human crafted input-output training pairs with human produced drawings. You can't just ask ChatGPT to generate every combination because all it'll produce is garbage that gets a lot worse the further from a pelican riding a bicycle.

                  It might work at first for the pelican and a few other animals/transport combination but what does it even mean for a man o' war riding a pyrosome? I asked every model I have access to generate an SVG for a "man o' war riding a pyrosome" and not a single one managed to draw anything resembling a pyrosome. Most couldn't even produce something resembling a man o' war except as a generic ellipsoid-shaped jellyfish with a few tenticles.

                  Expand that to every weird noun-noun combination and it's just not practical to train even a tiny fraction of them.

                  • fragmede 2 days ago ago

                    https://chatgpt.com/share/68def5c5-8ca4-8009-bbca-feabbe0651...

                    Man'o'war on a pyrosome. I don't what you expected it to look like, maybe it could be more whiteish translucent instead of orange, but it looks fairly reasonable to me. Took a bit over a minute with the ChatGPT app.

                    Simonw's test is for the text-only output from an LLM to write an SVG, not "can a multimodal AI in 2025" generate a PNG. By having pictures of pelicans on bicycles in the training data in PNG format, from people wanting to see one, after reading his blog, there are now raster-based images from an image generation model that fairly convincingly look as described in the training data. Now that there's PNGs of pelicans on bicycles, we would expect GPT-6 to be better at generating SVGs of something it's already "seen".

                    We don't know what simonw's secret combo X and Y is, nor do I want to know, because that would ruin the benchmark (if it isn't ruined already by virtue of him having asked it). 200k nouns is definitely high though. A bit of thought could cut it down to exclude concepts and lot of other things. How much spare GPU capacity OpenAI has, I have no idea. But if I were there, I'd want the GPUs to be running as hot as the cloud provider would let me run them, because they're paying per hour, not per watt, and have a low-priority queue of jobs for employees to generate whatever extra training data they can think of on their off hours.

                    Oh and here's the pelican PNG so the other platforms can crawl this comment and slurp it up.

                    https://chatgpt.com/share/68def958-3008-8009-91fa-99127fc053...

          • reissbaker 3 days ago ago

            I doubt they'd cheat that obviously... But "SVG of X" has become common enough that I suspect most frontier labs train on it, especially since the models are multimodal now anyway.

            Not that I mind; I want models to be good at generating SVG! Makes icons much simpler.

          • 3 days ago ago
            [deleted]
        • latemedium 3 days ago ago

          We need to know if big AI labs are explicitly training models to generate SVGs of pelicans on bicycles. I wouldn't put it past them. But it would be pretty wild in they did!

        • londons_explore 3 days ago ago

          As soon as you use your private tests, all the AI companies vacuum up the input to use to train the next model.

          Obviously they're only getting the question and not a perfect answer, but with today's process of generating hundreds of potential answers and getting another model to choose the best/correct one for training, I don't think that matters.

          • astrange 2 days ago ago

            Are the models capable of judging a good SVG? They can't read ASCII art.

            • londons_explore 2 days ago ago

              If you give the 'judge' models tool use, they could easily fire up a web browser to render an SVG and then use imagenet or something to see how 'pelican-y' the result is.

        • Workaccount2 2 days ago ago

          I honestly think people really blow out of proportion the effect of "being in the training set". The internet is ridden with examples of problem/solution posts that many models definitely trained on, but still get wrong.

          More important would be post training, where the labs specifically train on the exact question. But it doesn't seem like this is happening for most amateur benchmarks at least. All the models that are good at pelican bike have been good at whatever else you throw at them to SVG.

        • ajcp 3 days ago ago

          That's the move right there.

    • ceejayoz 3 days ago ago

      > a very realistic page on Apple's website…

      Is this supposed to be a good example?

      It looks like something I'd put together, and you don't want me doing design work.

    • epolanski 2 days ago ago

      Jm2c I couldn't care less about this vibecode style benchmarks and I'm sick of them.

      The things that I see represented here, may or may not be impressive, but sure as hell have never been the major blockers in achieving progress on complex tasks and software.

      I understand you're merely reporting, thank you for that, not criticizing you, but those tests are absolutely irrelevant.

      • mnk47 a day ago ago

        In my experience, the model's performance in silly tasks like these is usually (not always) correlated with its performance in other areas except tool use/agent stuff.

  • strongpigeon 3 days ago ago

    Google's biggest problem in my opinion (and I'm saying that as an ex-googler) is that Google doesn't have a product culture. Google had the tech for something like ChatGPT for a long time, but couldn't come up with that product. Instead it had to rely on another company showing it the way and then copy them and try to out-engineer them...

    I still think ultimately (and somewhat sadly) Google will win the AI race due to its engineering talent and the sheer amount of data it has (and Android integration potential).

    • thewebguyd 3 days ago ago

      > is that Google doesn't have a product culture.

      This is evident in Android and the pixel lineup, which could be my favorite phone if not for some of the most baffling and frustrating decisions that lead to a very weirdly disjointed app experience (comparing to something like iOS's first party tools).

      Like removing location based reminders from google tasks, for some reason? Still no apple shortcuts-like automation built-in, keep can still do location based reminders but it's a notes app so which am I supposed to use? Google tasks or keep? Well, gemini adds reminders to google tasks and not keep if I wanted to use keep primarily.

      If they just spent some time polishing and integrating these tools, and add some of their ML magic to it they'd blow Apple out of the park.

      All of Google's tech is cool and interesting, from a tech standpoint but it's not well integrated for a full consumer experience.

      • rubslopes 2 days ago ago

        I still can't fathom how one of my favorite Android features simply disappeared years ago: the 'time to leave' notification for calendar appointments with address info.

      • xooooogler 3 days ago ago

        Google recently let go ALL -- EVERY SINGLE -- L3/L4/L5 UX Researcher

        https://www.thevoiceofuser.com/google-clouds-cuts-and-the-bi...

        Could it be argued that perhaps UX Research was not working at all? Or that their recommendations were not being incorporated? Or that things will get even worse now without them?

        • tanaros 2 days ago ago

          The link says:

          > Some teams in the Google Cloud org just laid off all UX researchers below L6

          That’s not all UX researchers below L6 in the entire company. It doesn’t even sound like it’s all UX researchers below L6 in Google Cloud.

        • seemaze 3 days ago ago

          Maybe Apple should follow suit.. I jest, but I’m still processing the liquid glass debacle.

          • thewebguyd 3 days ago ago

            At least it's uniform. Unlike Material 3 expressive which might look different depending on the app, or not be implemented at all, or only half implemented in some of Google's own apps even, much like with every other Android redesign.

            I get Google can't force it on all the OEMs with their custom skins, but they can at least control their own PixelOS and their own apps.

            • layer8 2 days ago ago

              It’s not uniform at all. Some parts of the interface and of their apps get it, others don’t. Some parts look more glassy, some more frosty. It’s all over the place in terms of consistency. It’s also quite different between Apple’s various OSs, although allegedly the purpose was to unify their look.

    • byefruit 3 days ago ago

      And even when it does copy other products, it seems to be doing a terrible job of them.

      Google's AI offering is a complete nightmare to use. Three different APIs, at least two different subscriptions, documentation that uses them interchangeably.

      For Gemini's API it's often much simpler to actually pay OpenRouter the 5% surchargeto BYOK than deal with it all.

      I still can't use my Google AI Pro account with gemini-cli..

      • gardnr 3 days ago ago

        Then there's the billing dashboards...

        It's amazing how they can show useless data while completely obfuscating what matters.

        • ur-whale 3 days ago ago

          Yeah, the whole billing death march is what ended up making me pick OpenAI as my main worhorse instead of GOOG.

          Not enough brain cycles to figure out a way to give Google money, whereas the OpenAI subscription was basically a no-brainer.

      • cshores 3 days ago ago

        As of this week you can use gemini-cli with Google AI Pro

      • specproc 3 days ago ago

        I had great fun this week with the batch API. A good morning lost trying to work out how to do a not particularly complex batch request via JSONL.

        The python library is not well documented, and has some pretty basic issues that need looking at. Terrible, unhelpful errors, and "oh, so this works if I put it in camel-case" sort of stuff.

      • leobg 2 days ago ago

        litellm + gemini API key?

        I find Gemini is their first API that works like that. Not like their pre-Gemini vision, speech recognition, sheets etc.. Those were/are a nightmare to set up indeed.

    • sho_hn 3 days ago ago

      To be fair, according to OpenAI they started ChatGPT as a demo/experiment and were taken by surprise when it went viral.

      It may well be that they also didn't have a product culture as an organization, but were willing to experiment or let small teams do so.

      It's still a lesson, but maybe a different one.

      With organizational scale it becomes harder and harder to launch experiments under the brand. Red tape increases, outside scrutiny increases. Retaining the ability to do that is difficult.

      Google does experiment a fair bit (including in AI, e.g. NotebookLLM and its podcast feature are I think a standout example of trying to see what sticks) but they also tend to try to hide their experiments in developer portals nowadays, which makes it difficult to get a signal from a general consumer audience.

      • dudeinhawaii 2 days ago ago

        If I can take a slight tangent. This is what I will remember OpenAI for. Not the Closed vs Open debate. They caused the democratization of access to AI models. Prior to ChatGPT, I would hear about these great models Deep Mind and Google were developing. They'd always stay closed behind the walls of Google.

        OpenAI forced Google to release and as a result, we have all of the AI tooling, integrations, and models. Meta's leaning into the stolen Llama code took this further and sparked the Open Source LLM revolution (in addition to the myriad contributors and researchers who built on that).

        If we had left it to Google, I suspect they'd release tooling (as they did with TensorFlow) but not an LLM that might compete with their core product..

      • thereitgoes456 2 days ago ago

        According to Karen Hao's Empire of AI, this is only half accurate. And I trust what Karen Hao says a lot more.

        OpenAI mistakenly thought Anthropic was about to launch a chatbot, and ChatGPT was a scrappy, rushed-out-the-door product made from an intermediate version of GPT-4, meant to one-up them. Of course, they were surprised at how popular it became.

        • kristianp 2 days ago ago

          Do you mean an intermediate version of GPT-3? That's more the timeline I'm thinking.

      • ajcp 3 days ago ago

        -> With organizational scale it becomes harder and harder to launch experiments under the brand

        I feel like Google tried to solve for this with their `withgoogle.com` domain and it just ends up being confusing or worse still, frustrating when you see something awesome and then nothing ever comes of it.

      • strongpigeon 3 days ago ago

        Google is definitely good at experimenting (and yeah NotebookLLM is really cool), which is a product of the bottom-up culture. The lack of a consistent story with regard to AI products however is a testament to the lack of product vision from the top.

        • ajcp 3 days ago ago

          NotebookLM came out of Google Labs though, and in collaboration with outside stakeholders. I'm not sure I would call it a success of "bottom-up" culture, but a well realized idea from a dedicated incubator. That doesn't necessarily mean the rest of the company is so empowered or product oriented.

    • jvolkman 2 days ago ago

      I don't think Google was ever going to be the first to productize an LLM. LLMs say stupid shit - especially in the early days - and would've just attracted even more bad press if Google had been the front runner. OpenAI came along as a small, move-fast-and-break-things entity and introduced this tech to the public, and Google (and others) was able to join the fray after that seal was broken.

      • elcritch 2 days ago ago

        Good point, if Google had released the first version of Bard or whatnot as the first LLM it probably would've received some good press but also a lot of "eh just another Google toy side project". I could've seen myself saying that.

        • danielbln 2 days ago ago

          It would've joined the Google graveyard for sure.

    • stingraycharles 2 days ago ago

      This has plagued Google internally for decades. I’m reminder of Steve Yegge’s Google rant [1] from 14 years ago, and ChatGPT is evidence that they still haven’t fixed it.

      It’s amazing how pervasive company cultures can be, and how this comes from the top, and can only be fixed with replacing leadership with an extremely talented CEO that knows the company inside out and can change its course. Nadella from Microsoft comes to mind, although that was more about Microsoft going back to its roots (replace sales oriented leadership with product oriented leadership again).

      Google never had product oriented leadership in the same way that Amazon, Apple and Microsoft had.

      I don’t think this will ever change at this point.

      For those who haven’t read it, Steve Yegge’s rant about Google is worth your time:

      1 https://gist.github.com/chitchcock/1281611

    • xnx 3 days ago ago

      > Google doesn't have a product culture

      Fair criticism that it took someone else to make something of the tech that Google initially invented, but Google is furiously experimenting with all their active products since Sundar's "code red" memo.

    • renewiltord 3 days ago ago

      Well, they had an internal ethics team that told them that their technology was garbage. That can't help. The other guys' ethics teams are all like "Our stuff is too awesome for people to use. No one should have this kind of unbridled power. We must muzzle the beast before a tourist rides him" and Google's ethics team was like "our shit sucks lol this is just a Markov chain parrot doesn't do shit it's garbage".

      • Filligree 3 days ago ago

        Which, to be fair—we're talking about the pre-GPT-3.5 era—it kind of was?

        • charcircuit 2 days ago ago

          Don't you remember all of the scaremongering around how unethical it would be to release a GPT3 model publicly.

          Google personally reached out to someone trying to reproduce GPT3 and convinced him to abandon his plan of releasing it to the public.

          • mlsu 2 days ago ago

            There was scaremongering about releasing GPT-2.

            GPT-2!!

          • Imustaskforhelp 2 days ago ago

            And here we are after deepseek and the qwen models and so so much more like glm 4.6 which are reaching sota of sorts.

          • pixl97 2 days ago ago

            I mean, the level of scams that have occurred that time due to LLMs have increased so it's not exactly wrong.

        • renewiltord 3 days ago ago

          The unfortunate truth when you're on the cusp of a new technology: it isn't good yet. Keeping a team of guys around whose sole job it is to tell you your stuff sucks is probably not aligned with producing good stuff.

          • elcritch 2 days ago ago

            There's almost like an "uncanny valley" type situation with good products. As in new technologies start out promising, but less okay. Then as they get better they becomes close to being a "good project" the more it's not there yet. In that way it could feel sort of worse than a mediocre project. Until it's done.

          • nicr_22 2 days ago ago

            There's a world of difference between saying "our stuff sucks" vs "here are the specific ways our stuff isn't ready for launch". The former is just whining, the latter is what a good PM does.

    • raincole 2 days ago ago

      And we (average users) are really luck for that. Imagine a world where Google had been pushing AI products in the first place. OpenAI and other competitors would not stand a chance and it would have ads in 2024. They'd have captured hundreds of billions of value by now.

      The fact that we had Attention Is All You Need was freely available online alone was unbelievably fortunate from hindsight.

    • HarHarVeryFunny 2 days ago ago

      OpenAI were the ones that came up with RLHF, which is what made ChatGPT viable.

      Without RLHF, LLM-based chat was a psychotic liability.

    • adventured 3 days ago ago

      Along with its engineering talent and resource scale, I think their in-house chips are one of their core advantages. They can scale in a way that their peers are going to struggle to match, and at much lower cost. Nvidia's extreme margins are Google's opportunity.

    • 3 days ago ago
      [deleted]
    • wmf 3 days ago ago

      Didn't Google have Bard internally around the same time as ChatGPT?

      • blueg3 2 days ago ago

        Bard came out shortly after ChatGPT as a prototype of what would become Gemini-the-chatbot.

        There were other, less-available prototypes prior to that.

      • Rebelgecko 2 days ago ago

        Meena/Lamda were around the same time as gpt-2

      • eternal_braid 3 days ago ago

        Search for Meena from Google.

    • dyauspitr 2 days ago ago

      Why sadly? I’d rather the originators of the technology win.

      • lurking_swe 2 days ago ago

        its a different skillset, and also partially company culture.

        For example does a CSS expert know how to design a great website? _maybe_…but knowing the CSS spec in its entirely doesn’t (by itself) help you understand how to make useful or delightful products.

    • killerstorm 3 days ago ago

      ChatGPT-3.5 was more of a novelty than a product.

      It would be weird to release that as a serious company. They tried making a deliberately-wacky chatbot but it was not fun.

      Letting OpenAI to release it first was a right move.

      • Imustaskforhelp 3 days ago ago

        To me, I want openai to release the Chatgpt 3 and chatgpt 3.5 as the phenomenal leap of intelligence and even I appreciated the Chatgpt 3 a lot, more so than even now like It had its quirks but it was such a good model man.

        I remember forming a really simple dead simple sveltekit website during Chatgpt 3. It was good, it was mind blowing and I was proud of it.

        The only interactivity was a button which would go from one color to other and it would then lead to a pdf.

        If I am going to be honest, the UI was genuinely good. It was great tho and still gives me more nostalgia and good vibes than current models. Em-dashes weren't that common in Chatgpt 3 iirc but I have genuinely forgotten what it was like to talk to it

    • londons_explore 3 days ago ago

      > Android integration potential

      Nearly all the people that matter use iPhone... Yet Apple really hasn't had much success in the AI world, despite being in a position to win if their product is even only vaguely passable.

  • robots0only 3 days ago ago

    In all of these posts there is someone claiming Claude is the best, then somebody else claiming they have tried a bunch of times and for them Gemini is the best while others find GPT-5 is supreme. Obviously, all of these are subjective narrow experiences. My conclusion is that all frontier models are both good and bad with no clear winner and making good evals is really hard.

    • SkyPuncher 3 days ago ago

      I'll be that person:

      * Gemini has the highest ceiling out of all of the models, but has consistently struggled with token-level accuracy. In other words, it's conceptual thinking it well beyond other models, but it sometimes makes stupid errors when talking. This makes it hard to reliably use for tool calling or structured output. Gemini is also very hard to steer, so when it's wrong, it's really hard to correct.

      * Claude is extremely consistent and reliable. It's very, very good at the details - but will start to forget things if things get too complex. The good news is Claude is very steerable and will remember those details if you remind it.

      * GPT-5 seems to be completely random for me. It's so inconsistent that it's extremely hard to use.

      I tend to use Claude because I'm the most familiar with it and I'm confident that I can get good results out of it.

      • artdigital 3 days ago ago

        I’d say GPT-5 is the best in following and remembering instructions. After an initial plan it can easily continue with said plan for the next 30-60 minutes without human intervention, and come back with a complete working finished feature/product.

        It’s honestly crazy how good it is, coming from Claude. I never thought I could already pass something a design doc and have it one-shot the entire thing with such level of accuracy. Even with Opus, I always need to either steer it, or fix the stuff it forgot by hand / have another phase afterwards to get it from 90% to 100%.

        Yes the Codex TUI sucks but the model with high reasoning is an absolute beast, and convinced me to switch from Claude Max to ChatGPT Pro

      • Workaccount2 2 days ago ago

        Gemini is also the best for staying on the ball (when it does) over long contexts.

        It's really the only model that can do large(er) codebase work.

        • brulard 2 days ago ago

          Claude can do large code bases too, you just need to make it focus on parts that matter. Most of the coding tasks should not involve all parts of the code, right?

      • bcrosby95 3 days ago ago

        GPT-5 seems best at analyzing the codebase for me. It can pick up nuances and infer strategies Claude and Gemini seem to fail at.

      • Alex-Programs 3 days ago ago

        Personally I prefer Gemini because I still use AI via chat windows, and it can do a good ~90k tokens before it starts getting stupid. I'm yet to find an agent that's actually useful, and doesn't constantly fuck up everywhere while burning money.

    • Keyframe 3 days ago ago

      Answer is a classic programming one - it depends? There are definitely differences in strength and weaknesses among them.

      I run claude CLI as a primary and just ask it nicely to consult gemini cli (but not let it do any coding). It works surprisingly well. OpenAI just fell out of my view. Even cancelled ChatGPT subscription. Gemini is leaping forward and _feels like_ ChatGPT-5 is a regression.. I can't put my finger on it tbh.

    • Robdel12 3 days ago ago

      Yeah, my take is it’s sort of up to the person using the LLM and maybe how they match to that LLM. That’s my hunch as to why we hear wildly different takes on these LLMs working for people. Gemini can be the most productive model for some while others find it entirely unworkable.

      • jiggawatts 3 days ago ago

        Not just personalities and preferences, but the purpose for which the AI is being used also affects the results. I primarily use AIs for complex troubleshooting along the lines of: "Here's a megabyte of logs, an IaC template, and a gibberish error code. What's the reason?" Right now, only Gemini Pro 2.5 has any chance of providing a useful output given those inputs, because its long-context attention is better than any other model's.

    • binary132 3 days ago ago

      The fact that there is so much astroturf out there also makes it difficult to evaluate these claims

    • smoe 3 days ago ago

      Capability wise, they seem close enough that I don’t bother re-evaluating them against each other all the time.

      One advantage Gemini had (or still has, I’m not sure about the other providers) was its large context window combined with the ability to use PDF documents. It probably saved me weeks of work on an integration with a government system uploading hundreds of pages of documentation and immediately start asking questions, generating rules, and troubleshooting payloads that were leading to generic, computer-says-no errors.

      No need to go trough RAG shenanigans and all of it within the free token allowance.

    • qaq 3 days ago ago

      In my experience gemini is good at writing specs it's hit or miss in reviewing code and it's not really usable for iterating on code. Codex is slow but can crack issues that Claude Code struggles with. So my workflow has being to use all three to iterate on specs. Have claude code work on implementation and have Codex review claude code's work (sometimes have gemini double check it).

    • mlsu 2 days ago ago

      Because how good a model is is mostly just what the training data is at this point.

      It's like the personality of a person. Employee A is better at talking to customers than Employee B, but Employee B is better at writing code than Employee A. Is one better than the other? Is one smarter than the other? Nope. Different training data.

  • ACCount37 3 days ago ago

    I hope this is the one that unfucks the multi-turn instruction following.

    One of the biggest issues holding Gemini back, IMO, compared to the competitors.

    Many LLMs are still plagued by "it's easier to reset the conversation than to unfuck the conversation", but Gemini 2.5 is among the worst.

    • solarkraft 3 days ago ago

      Gemini‘s loops are a real problem. Within a few minutes of using it in the CLI it happened to me me („I can verify that I fulfilled the user’s request, I can verify that I fulfilled the user’s request …“). It’s telling that the CLI has a detection for this.

      The other day I asked 2.5 Pro for suggestions. It would provide one, which I rejected with some reasoning. It would provide another, which I also rejected. Asked for more it would then loop between the two, repeating the previous suggestions verbatim. It went on for 3-4 times, even after being told to reflect on it and it being able to recite the rejection reasons.

  • maerch 3 days ago ago

    I still have a bad taste in my mouth after all those GPT-5 hype articles that claimed the model was just one step away from AGI.

    • gardnr 3 days ago ago

      TBF, they all believed that scaling reinforcement learning would achieve the next level. They had planned to "war-dial" reasoning "solutions" to generate synthetic datasets which achieved "success" on complex reasoning tasks. This only really produced incremental improvements at the cost of test-time compute.

      Now Grok is publicly boasting PhD level reasoning while Surge AI and Scale AI are focusing on high quality datasets curated by actual PhD humans.

      Surge AI is boasting $1B in revenue, and I am wondering how much of that was paid in X.ai stock: https://podcasts.apple.com/us/podcast/the-startup-powering-t...

      In my opinion the major advancements of 2025 have been more efficient models. They have made smaller models much, much better (including MoE models) but have failed to meaningfully push the SoTA on huge models; at least when looking at the USA companies.

      • ACCount37 2 days ago ago

        Raw model size is still pegged by the hardware.

        You can try to build a monster the size of GPT-4.5, but even if you could actually make the training stable and efficient at this scale, you still would suffer trying to serve it to the users.

        Next generation of AI hardware should put them in reach, and I expect that model scale would grow in lockstep with new hardware becoming available.

      • svachalek 3 days ago ago

        Same, qwen3 omni blows my mind for what a 30b-A3b model can do. I had a video chat with it and it correctly identified plant species I showed it.

    • adastra22 2 days ago ago

      Without defining “AGI” that’s always true, and trivially so.

  • vunderba 3 days ago ago

    Outside of the aesthetic, the very first example on that twitter post is "balls bouncing around a constrained rotating rigid physics environment" which has been trivially one-shottable since Claude Code was first announced.

    It was one of the first things I tried when Claude Code went GA:

    https://gondolaprime.pw/hex-balls

    • Synaesthesia 3 days ago ago

      They have differing degrees of fidelity to the simulation, this one looks pretty good and it's got parameters, but yes the LLM's are really advanced now in what they can do. I was actually blown away during the Gemini 2.5 announcement with some of the demos people came up with.

  • daft_pink 2 days ago ago

    Did they fix the fact that they train on your data on personal plans that you pay for unless you disable chat history?

    They are literally the worst major provider in terms of privacy for consumer paid service.

  • theknarf 2 days ago ago

    The current problem with Gemeni 2.5 Pro is not that its not intelligent or can't oneshot problem, the problem is that its _terrible_ at tool calling and waste most of its context on trying to correct itself from mistaken tool calling. If they can solve that with 3.0 then they may have a useful model for agentic coding, if not its not keeping up with Anthropic and OpenAI.

    • XCSme 2 days ago ago

      I wanted to use Gemini 2.5 Pro, but couldn't, because their structured_data response is broken (doesn't support all JSON properties, or often simply returns garbage).

  • nharada 3 days ago ago

    Gemini has always been the leader in multimodal work like images and video, I expect this won't be any different but am interested to see how it is

  • whywhywhywhy 3 days ago ago

    These influencer tests are so pointless and don't represent the reality of model use at all when things are constantly being downgraded when people actually use the thing.

    Not to mention every team will have the bouncing balls in the polygon in their dataset now.

  • Incipient 2 days ago ago

    All these AI reviews seem to be following the axiom(?) "proof of the pudding is in the eating" but frankly I don't think that applies to code.

    I can't get even gpt5 to create a new feature without generating completely awful code - making up facts where it can't find how it fits into the rest of the code - and functionality spawning error ridden unmaintainable mess.

    I've spent this whole week debugging AI trash. And it's not fun.

  • geraldalewis 2 days ago ago

    This seems like a parody, but I think it's not.

  • baxuz 2 days ago ago

    Is there a source that isn't Twitter?

  • alberth 2 days ago ago

    Am I the only one who struggles to even find where to use Googles AI offerings.

    It took me way too long to figure out how to even access & use Veo 3.

    It’s like Google doesn’t know how to package a product.

    • seandoe 2 days ago ago

      gemini.google.com

  • 3 days ago ago
    [deleted]
  • Oras 3 days ago ago

    These tests mean nothing; I yet to see a model that is better than Sonnet 4 for coding. I tried many, all of them are sub-par, even with a small code base.

    • nnevatie 3 days ago ago

      Well, Codex with GPT5 High wins Claude Sonnet 4.5 - this is anecdotal, but I've used both extensively.

      • solarkraft 3 days ago ago

        At what speed? At some point you’ll have to compare to Opus.

        • adastra22 2 days ago ago

          And Sonnet 4.5 is better than Opus.

    • Bolwin 3 days ago ago

      Well yeah no surprise. You should try glm 4.6

      • Oras 2 days ago ago

        I tried it, and it was shockingly bad compared to their benchmarks and to Claude Sonnet 4.

        I tried it with Claude Code CLI, it didn't follow instructions correctly (I had a Claude.md file with clear instructions), stopped after a few implementations (less than 3 minutes), and produced code that does not work.

        For the benefit of the doubt, I changed instructions to be NextJS platform as I thought it's a known framework and it might do better, but still, same quality issues.

    • adastra22 2 days ago ago

      Well, Sonnet 4.5 is better.

  • esafak 3 days ago ago

    We can't see the code and the challenge is pedestrian. Nothing to see here.

  • renewiltord 3 days ago ago

    Every three months there's some mind blowing hype around a Google product, lots of people talk about it, and then when I use it it's not nearly as good.