Diffusion for World Modeling

(diamond-wm.github.io)

376 points | by francoisfleuret 12 hours ago ago

176 comments

  • smusamashah 9 hours ago ago

    This video https://x.com/Sentdex/status/1845146540555243615 looks way too much like my dreams. This is almost exactly that happens when I sometimes try to jump high, it transforms me to a different place just like that. Things keep changing just like that. It's amazing to see how close it is to a real dream experience.

    • kleene_op 7 hours ago ago

      I noticed that all text looked garbled up when I had some lucid dreams. When diffusion models started to gain attention, I made the connection that text generated in generated images also looked garbled up.

      Maybe all of those are clues that parts of the human subconscious mind operate pretty close to the principles behind diffusion models.

      • qwertox 4 hours ago ago

        I don't think lucid dreaming is a requirement for this. Whenever I dream my environment morphs into another one, scene by scene, things I try to get details from, like the content of a text, refuse to show clearly enough to extract any meaningful information from it, no matter what I try.

      • smusamashah 6 hours ago ago

        I also lucid dream occasionally. Very rarely things are very detailed, most often the colors and details are just as bleak and blurry and keep changing as these videos. I walk down a street, take a turn (or not), its almost guaranteed I can't go back to where I came from. I usually appreciate when I can track back the same path.

      • sci_prog 7 hours ago ago

        Also the AI generated images that can't get the fingers right. Have you ever tried to look at your hands while lucid dreaming and try counting fingers? There are some really interesting parallels between the dreams and diffusion models.

        • dartos 6 hours ago ago

          Of course, due to the very nature of dreams, your awareness of diffusion models and their output flavors how you perceive even past dreams.

          Our brains love retroactively altering fuzzy memories.

          • Jackson__ 4 hours ago ago

            This may be a joke, but counting your fingers to lucid dream has been a thing for a lot longer than diffusion models.

            That being said, your reality will influence your dreams if you're exposed to some things enough. I used to play minecraft on a really bad PC back in the day, and in my lucid dreams I used to encounter the same slow chunk loading as I saw in the game.

          • hombre_fatal 4 hours ago ago

            On the other hand, psychedelics give you perceptions similar to even early deepdream genai images.

            On LSD, I was swimming in my friend’s pool (for six hours…) amazed at all the patterns on his pool tiles underwater. I couldn’t get enough. Every tile had a different sophisticated pattern.

            The next day I went back to his place (sober) and commented on how cool his pool tiles were. He had nfi what I was talking about.

            I walk out to the pool and sure enough it’s just a grid of small featureless white tiles. Upon closer inspection they have a slight grain to them. I guess my brain was connecting the dots on the grain and creating patterns.

            It was quite a trip to be so wrong about reality.

            Not really related to your claim I guess but I haven’t thought of this story in 10 years and don’t want to delete it.

    • siavosh 2 hours ago ago

      What’s amazing is that if you really start paying attention it seems like the mind is often doing the same thing when you’re awake, less noticeable with your visual field but more noticeable with attention and thoughts themselves.

      • smusamashah an hour ago ago

        This is a very interesting thought. I never thought of mind doing anything like that in wake state. I know I will now be thinking about this idea every time I recall those dreams.

    • jvanderbot 2 hours ago ago

      This is why I'm excited in a limited way. Clearly something is disconnected in a dream state that has an analogous disconnect here.

      I think these models lack a world model, something with strong spatial reasoning and continuity expectations that animals have.

      Of course that's probably learned too.

    • thegabriele 6 hours ago ago

      We are unconsciously (pun intended) implementing how brains work both in dream and wake states. Can't wait until we add some kind of (lossless) memory to this models.

      • hackernewds 6 hours ago ago

        Any evidence to back this lofty claim?

      • soraki_soladead 6 hours ago ago

        We have lossless memory for models today. That's the training data. You could consider this the offline version of a replay buffer which is also typically lossless.

        The online, continuous and lossy version of this problem is more like how our memory works and still largely unsolved.

    • earnesti 6 hours ago ago

      That looks way too much to the one time I did DMT-5

  • francoisfleuret 11 hours ago ago

    This is 300M parameters model (1/1300th of the big llama-3) trained with 5M frames with 12 days of a GTX4090.

    This is what a big tech company was doing in 2015.

    The same stuff at industrial scale à la large LLMs would be absolutely mind blowing.

    • gjulianm 8 hours ago ago

      What exactly would be the benefit of that? We already have Counter Strike working far more smooth than this, without wasting tons of compute.

      • ben_w 8 hours ago ago

        As with diffusion models in general, the point isn't the specific example but that it's generalisable.

        5 million frames of video data with corresponding accelerometer data, and you get this for genuine photorealism.

        • gjulianm 4 hours ago ago

          Generalisable how? The model completely hallucinates invalid input, it's not even high quality and required CSGO to work. What's the output you expect from this and what alternatives are there?

          • ben_w 2 hours ago ago

            It did not require CSGO, that was simply one of their examples. The very first video in the link shows a bunch of classic Arati games, and even the video which is showing CSGO is captioned "DIAMOND's diffusion world model can also be trained to simulate 3D environments, such as CounterStrike: Global Offensive (CSGO)" — I draw your attention to "such as" being used rather than "only".

            And I thought I was fairly explicit about video data, but just in case that's ambiguous: the stuff you record with your phone camera set to video mode, synchronised with the accelerometer data instead of player keyboard inputs.

            As for output, with the model as it currently stands, I'd expect a 24h training video at 60fps to be "photorealisic and with similar weird hallucinations". Which is still interesting, even without combining this with a control net like Stable Diffusion can do.

      • stale2002 4 hours ago ago

        To answer your question directly, the benefit is that we could make something different from counter strike.

        You see, there are these things called "proof of concept"s that are meant to not be a product, but instead show off capabilities.

        Counterstrike is an example, meant to show off complex capabilities. It is not meant to show how the useful thing of these models is to literally recreate counterstrike.

        • gjulianm 4 hours ago ago

          Which capabilities are being shown off here? The ability to take an already existing world-model and take lots of compute to have a worse, less correct model?

          • stale2002 2 hours ago ago

            The capability to have mostly working, real time generation of images that represent a world model.

            If that capability is possible, then it could be possible to take 100 examples of seperate world models that exist, and then combine those world models together in interesting ways.

            Combining together world models is an obvious next step (IE, not showed off in this proof of concept. But it is a logical/plausible future capability).

            Having multiple world models combined together in new and interesting ways, is almost like creating an entirely new world model, even though thats not exactly the same.

      • eproxus 8 hours ago ago

        But please, think of the shareholders!

      • nuz 7 hours ago ago

        "What would be the point of creating a shooter set in the middle east? We already have pong and donkey kong"

    • GaggiX 10 hours ago ago

      If 12 days with an RTX4090 is all you need, some random people on the Internet will soon start training their own.

    • cs702 4 hours ago ago

      Came here to say pretty much the same thing, and saw your comment.

      The rate of progress has been mind-blowing indeed.

      We sure live in interesting times!

    • Sardtok 9 hours ago ago

      Two 4090s, but yeah.

      • Sardtok 9 hours ago ago

        Never mind, the repo on Github says 12 days on a 4090, so I'm unsure why the title here says two.

  • marcyb5st 9 hours ago ago

    So, this is pretty exciting.

    I can how this can already be used to generate realistic physics approximations in a game engine. You create a bunch of snippets of gameplay using a much heavier and realistic physics engine (perhaps even CGI). The model learn to approximate the physics and boom, now you have a lightweight physics engine. Perhaps you can even have several that are specialized (e.g. one for smoke dynamics, one for explosions, ...). Even if it allucinates, wouldn't be worse than the physics bugs that are so common in games.

    • monsieurbanana 9 hours ago ago

      > Even if it allucinates, wouldn't be worse than the physics bugs that are so common in games.

      I don't know about that. Physic bugs are common, but you can prioritize and fix the worst (gamebreaking) ones. If you have a blackbox model, it becomes much harder to do that.

    • bobsomers 4 hours ago ago

      What makes you think the network inference is less expensive? Newtonian physics is already extremely well known and pretty computationally efficient to compute.

      How would a "function approximation" of Newtonian physics, with billions of parameters, be cheaper to compute?

      It seems like this would both be more expensive and less correct than a proper physics simulation.

    • twic 8 hours ago ago

      Do you think that inference on a thirteen million parameter neural network is more lightweight than running a conventional physics engine?

      • procgen 6 hours ago ago

        Convincing liquid physics (e.g. surf interacting with a beach, rocks, the player character) might be a good candidate.

      • tiagod 5 hours ago ago

        In some cases, the model will be lighter. There is no need for 14M parameters for physics simulations, and there's a lot of promising work in that area.

      • epolanski 8 hours ago ago

        Every software that can be implemented in a JavaScript, ehm, LLM, will eventually be implemented in an LLM.

        • kendalf89 6 hours ago ago

          Are you predicting node.llm right now?

    • amelius 5 hours ago ago

      > boom, now you have a lightweight physics engine

      lightweight, but producing several hundred watts of heat.

    • crazygringo 4 hours ago ago

      Yeah, I definitely wouldn't trust it to replace basic physics of running, jumping, bullets, objects shattering, etc.

      But it seems extremely promising for fiery explosions, smoke, and especially water. Anything with dynamics that are essentially complex.

      Also for lighting -- both to get things like skin right with subsurface scattering, as well as global ray-traced lighting.

      You can train specific lightweight models for these things, and they important thing is that their output is correct at the macro level. E.g., a tree should be casting a shadow that looks like the right shadow at the right angle for that type of tree and its types of leaves and general shape. Nobody cares if each individual leaf shadow corresponds to an individual leaf 10 feet above or is just hallucinated.

    • Thorrez 8 hours ago ago

      Would that work for multiplayer? If it's a visual effect only, I guess it would be ok. But if it affects gameplay, wouldn't different players get different results?

      • killerstorm 8 hours ago ago

        Well, it doesn't make sense to use this exact model - this is just demonstration that it can learn world model from pixels.

        An obvious next step towards a more playable game is to add state vector to the inputs of the model: it is easier to learn to render the world from pixels + state vectors than from pixels alone.

        Then it depends what we want to do. If we want normal Counter Strike gameplay but with new graphics, we can keep existing CS game server and train only the rendering part.

        If you want to make Dream-Counter-Strike where rules are more bendable then you might want to train state update model...

    • hobs 9 hours ago ago

      A physics bug would be a consistent problem you can fix. There's no such guarantee about an ML model. This would likely only be ok in the context of a game specifically made to be janky.

      • fullstackwife 9 hours ago ago

        This is one of the fallacies of current AI research space: they don't focus on the end-user too much. In this case the end-user would be the gamer, and while playing games you expect a valid gameplay, so those kind of hallucinations are not acceptable, while I'm pretty sure they give the AI research authors a strong dopamine trigger. We have a hammer and now we are looking for a nail, while you should ask a question first: what is the problem we are trying to solve here?

        Real world usage will be probably different, and maybe even unexpected by the authors of this research.

        • jsheard 9 hours ago ago

          > This is one of the fallacies of current AI research space: they don't focus on the end-user too much. In this case the end-user would be the gamer

          Or from another angle the end-user is a game developer trying to actually work with this kind of technology, which is just a nightmarish prospect. Nobody in the industry is asking for a game engine that runs entirely on vibes and dream logic, gamedev is already chaotic enough when everything is laid out in explicit code and data.

        • stale2002 4 hours ago ago

          > they don't focus on the end-user too much.

          Of course they don't. Stuff like this is a proof of concept.

          If they had a product that worked, they wouldn't be in academia. Instead, they would leave the world of research and create a multi billion dollar company.

          Almost by definition, anything in academia isn't going to be productized, because if it was, then the researchers would just stop researching and make a bunch of money selling the product to consumers.

          Such research is still useful for society, though, as it means that someone else can spend the millions and millions of dollars making a better version and then selling that.

        • badpun 8 hours ago ago

          The whole purpose of academia is literally to nerd out on cool, impractical things, which will ocasionally turn out to have some real-life relevance years or decades later. This (hallucinated CS) is still more relevant to real world than 99% of what happens in academic research.

          • dartos 6 hours ago ago

            Yes to the first part, no to the random “99% useless” number you made up.

            I’m no fan of academia, but it undeniably produces useful and meaningful knowledge regularly.

      • kqr 9 hours ago ago

        This obsession people have with determinism! I'd much rather take a low rate of weird bugs than common consistent ones. I don't believe reproducibility of bugs makes for better gameplay generally.

        • paulryanrogers 8 hours ago ago

          Reproducibility does make bugs more likely to be fixed, or at least fixable.

          Also, games introduce randomness in a controlled way so users don't get frustrated by it appearing in unexpected places. I don't want characters to randomly appear and disappear. It's fine if bullet trajectory varies more randomly as they get further away.

          • skydhash 6 hours ago ago

            Also most engines have been worked on for years. So more often than not, core elements like audio, physics, input,... are very stable and the remaining bugs are either "can't fix" or "won't fix".

        • NotMichaelBay 8 hours ago ago

          It might be fine for casual players, but it would prevent serious and pro players from getting into the game. In Counter-Strike, for example, pro players (and other serious players) practice specific grenade throws so they can use them reliably in matches.

          • kqr 8 hours ago ago

            I'm not saying one can make specifically Counter-Strike on a non-deterministic engine -- that seems like strawmanning my argument.

            People play and enjoy many games with varying levels of randomness as a fundamental component, some even professionally (poker, stock market). This could be made such a game.

            • monsieurbanana 7 hours ago ago

              Either the physics engine matter, in which case you want a deterministic engine as you said, or it doesn't like in a poker game and you don't want to spend much resources (manpower, computer cycles) into it.

              Which also means an off-the-shelf deterministic engine.

        • mrob 7 hours ago ago

          The whole hobby of speedrunning relies heavily on exploiting deterministic game bugs.

        • dartos 6 hours ago ago

          You don’t play a lot of games, huh?

          Consistent bugs you can anticipate and play/work around, random ones you can’t. Just look at pretty much any speed running community for games before 1995.

          Say goodbye to any real competitive scene with random unfixable potentially one off bugs.

        • hobs 6 hours ago ago

          Make a fun game with this as a premise and I will try it, but it sounds just an annoying concept.

  • croo 11 hours ago ago

    For anyone who actually tried it :

    Does it respects/builds some kind of game map in the process or is it just a bizarre psychedelic dream walk experience where you cannot go back the same place twice and space dimensions are just funny? Is a game map finite?

    • InsideOutSanta 10 hours ago ago

      Just looking at the first video, there's a section where structures just suddenly appear in front of the player, so this does not appear to build any kind of map, or have any kind of meaningful awareness of something resembling a game state.

      This is similar to LLM-based RPGs I've played, where you can pick up a sword and put it in your empty bag, and then pull out a loaf of bread and eat it.

      • anal_reactor 9 hours ago ago

        > you can pick up a sword and put it in your empty bag, and then pull out a loaf of bread and eat it

        Mondays

    • aidos 11 hours ago ago

      Just skimmed the article but my guess is that it’s a dream type experience where if you turned around 180 and walked the other direction it wouldn’t correspond to where you just came from. More like an infinite map.

      • lopuhin 9 hours ago ago

        I don't think so, what they show on CS video is exactly the Dust2 map, not just something similar/inspired by it.

        • twic 8 hours ago ago

          It's trained on moving around dust2, so as long as the previous frame was a view of dust2, the next frame is very likely to be a plausible subsequent view of dust2. In some sense, this encodes a map; but it's not what most people think of when they think about maps.

          I'd be interested to see what happens if you look down at your feet for a while, then back up. If the ground looks the same everywhere, do you come up in a random place?

        • arendtio 7 hours ago ago

          It probably depends on what you see. As long as you have a broad view over a part of the map, you should stay in that region, but I guess that if you look at a mono-color wall, you probably find yourself in a very different part of the map when you look around yourself again.

          But I am just guessing, and I haven't tried it yet.

    • delusional 11 hours ago ago

      Just tried it out, and no. It doesn't have any sort of "map" awareness. It's very much in the "recall/replay" category of "AI" where it seems to accurately recall stuff that is part of the training dataset, but as soon as you do something not in there (like walk into a wall), it completely freaks out and spits out gibberish. Plausible gibberish, but gibberish none the less.

      • neongreen 11 hours ago ago

        Can you upload a screen recording? I don’t think I can run the model locally but it’d be super interesting to see what happens if you run into a wall

      • kqr 9 hours ago ago

        This should mainly be a matter of giving it more training though, right? It sounds like to amount of training it's gotten is relatively sparse.

        • treyd 8 hours ago ago

          It doesn't have any ability to reason about what you did more than a couple of seconds ago. Its memory is what's currently on the screen and what the user's last few inputs were.

        • delusional 4 hours ago ago

          Theoretically. In practice, that's not clear. As you add more training data you have to ask yourself what the point is. we already have a pretty good simulation of Counter Strike.

  • cousin_it 11 hours ago ago

    I continue to be puzzled by people who don't notice the "noise of hell" in NN pictures and videos. To me it's always recognizable and terrifying, has been from the start.

    • npteljes 10 hours ago ago

      What do you mean by noise of hell in particular? I do notice that the images are almost always uncanny in a way, but maybe we're not meaning the same thing. Could you elaborate on what you experience?

    • taneq 10 hours ago ago

      Like a subtle but unsettling babble/hubbub/cacophony? If so then I think I kind of know what you mean.

      • TechDebtDevin 5 hours ago ago

        There's definately a bit of an uncanny valley in the land of top tier diffusion models. A generative video of someone smiling is way more likely to illicit this response for me than a generative image or single frame. It definately has something to do with the movement.

      • cousin_it 4 hours ago ago

        Yes, that's exactly it.

    • HKH2 11 hours ago ago

      Eyes have a lot of noise too.

  • mk_stjames 10 hours ago ago

    This was Schmidhuber's group is 2018:

    https://worldmodels.github.io/

    Just want to point that out.

    • afh1 8 hours ago ago

      Ahead of its time for sure. Dream is an accurate term here, that driving scene does resemble driving in dreams.

  • jmchambers 10 hours ago ago

    I _think_ I understand the basic premise behind stable diffusion, i.e., reverse the denoising process to generate realistic images but, as far as I know, this is always done at the pixel level. Is there any research attempting to do this at the 3D asset level, i.e., subbing in game engine assets (with position and orientation) until a plausible scene is recreated? If it were possible to do it that way, couldn't it "dream" up real maps, with real physics, and so avoid the somewhat noisy output these types of demo generate?

    • desdenova 10 hours ago ago

      I think the closest we have right now is 3D gaussian splatting.

      So far it's only been used to train a scene from photographs from multiple angles and rebuild it volumetrically by adjusting densities in a point-cloud.

      But it might be possible to train a model on multiple different scenes, and perform diffusion on a random point cloud to generate new scenes.

      Rendering a point cloud in real time is also very efficient, so it could be used to create insanely realistic game worlds instead of polygonal geometry.

      It seems someone already thought of that: https://ar5iv.labs.arxiv.org/html/2311.11221

      • jmchambers 9 hours ago ago

        Interesting, I guess that takes things even further and removes the need for hand-crafted 3D assets altogether, which is probably how things will end up going in gaming, long-term.

        I was suggesting a more modest approach, I guess, one where the reverse-denoising process involves picking and placing existing 3D assets, e.g., those in GTA 5, so that the process is actually building a plausible map, using those 3D assets, but on the fly...

        Turn your car right and a plausible street decorated with buildings, trees and people is dreamt up by the algorithm. All the lighting and physics would still be done in-engine, with stable diffusion acting as a dynamic map creator, with an inherent knowledge of how to decorate a street with a plausible mix of assets.

        I suppose it could form the basis of a procedurally generated game world where, given the same random seed, it could generate whole cities or landscapes that would be the same on each player's machine. Just an idea...

        • skydhash 6 hours ago ago

          The thing is that, there are generators that can do exactly this, no need to have an LLM as the middle man. Things like terrain generation, city generation, crowd control, character generation, can be done quite easily with far less compute and energy.

      • magicalhippo 6 hours ago ago

        Technically I guess one could do a stable diffusion-like model except on voxels, where instead of pixel intensity values it producing a scalar field which you could turn into geometry using marching cubes or something similar.

        Not sure how efficient that would be though, and would only work for assets like teapots and whatnot, not whole game maps say.

    • furyofantares 2 hours ago ago

      > but, as far as I know, this is always done at the pixel level

      Image models are NOT denoised at the pixel level - diffusion happens in latent space. This was one of the big breakthroughs that made all of this work well.

      There's a model for encoding/decoding between pixels and latent space. Latent space is able to encode whatever concepts it needs in whichever of its dimensions it needs, and is generally lower dimensional than pixel space. So we get a noisy latent space, denoise it using the diffusion model, then use the other model (variational autoencoder) to decode into pixel space.

    • jampekka 8 hours ago ago

      Not exactly 3D assets, but diffusion modems are used to generate e.g. traffic (vehicle trajectories) for evaluating autonomous vehicle algorithms. These vehicles tend to crash quite a lot.

      For example https://github.com/NVlabs/CTG

      Edit: fixed link

    • tiborsaas 8 hours ago ago

      Generating this at pixel level is the next level thing. The reverse engineering method your described is probably appealing because it's easier to understand.

      Focusing on pixel level generation is the right approach I think. The somewhat noisy output will be improved upon probably in a short timeframe. Now that they proved with Doom (https://gamengen.github.io/) and this that it's possible, probably more research is happening currently to nail the correct architecture to scale this to HD and minimal hallucination. It happened with videos alredy so we should see a similar level breakthrough soon.

    • gliptic 10 hours ago ago

      > I _think_ I understand the basic premise behind stable diffusion, i.e., reverse the denoising process to generate realistic images but, as far as I know, this is always done at the pixel level.

      It's typically not done at the pixel level, but at the "latent space" level of e.g. a VAE. The image generation is done in this space, which has fewer outputs than the pixels of the final image, and then converted to the pixels using the VAE.

      • jmchambers 9 hours ago ago

        Frantically Googles VAE...

        Ah, okay, so the work is done at a different level of abstraction, didn't know that. But I guess it's still a pixel-related abstraction, and it is converted back to pixels to generate the final image?

        I suppose in my proposed (and probably implausible) algorithm, that different level of abstraction might be loosely analogous to collections of related game engine assets that are often used together, so that the denoising algorithm might be effectively saying things like "we'll put some building-related assets here-ish, and some park-related flora assets over here...", and then that gets crystallised in to actual placement of individual assets in the post-processing step.

        • StevenWaterman 7 hours ago ago

          (High level, specifics are definitely wrong here)

          The VAE isn't really pixel-level, it's semantic-level. The most significant bits in the encoding are like "how light or dark is the image" and then towards the other end bits represent more niche things like "if it's an image of a person, make them wear glasses". This is way more efficient than using raw pixels because it's so heavily compressed, there's less data. This was one of the big breakthroughs of stable diffusion compared to previous efforts like disco diffusion that work on the pixel level.

          The VAE encodes and decodes images automatically. It's not something that's written, it's trained to understand the semantics of the images in the same way other neural nets are.

  • DrSiemer 11 hours ago ago

    Where it gets really interesting is if we can train a model on the latest GTA, plus maybe related real life footage, and then use it to live upgrade the visuals of an old game like Vice City.

    The lack of temporal consistency will still make it feel pretty dreamlike, but it won't matter that much, because the base is consistent and it will look amazing.

    • InsideOutSanta 9 hours ago ago

      Just redrawing images drawn by an existing game engine works, and generates amazing results, although like you point out, temporal consistency is not great. It might interpret the low-res green pixels on a far-away mountain as fruit trees in one frame, and as pines in the next.

      Here's a demo from 2021 doing something like that: https://www.youtube.com/watch?v=3rYosbwXm1w

    • davedx 10 hours ago ago

      A game like GTA has way too much functionality and complex branching for this to work I think (beyond eg doing aimless drives around the city — which would be very cool though)

      • DrSiemer 4 hours ago ago

        Gta 5 has everything Vice City has and more. In the Doom AI dream it's possible to shoot people. Maybe in this CS model as well?

        I think the model does not have to know anything about the functionality. It can just dream up what is most probable to happen based on the training data.

    • taneq 31 minutes ago ago

      Using it as a visual upgrade is pretty close to what DLSS does so that sounds plausible.

    • sorenjan 9 hours ago ago

      In addition to the sibling comment's older example there's new work done with GTA too.

      https://www.reddit.com/r/aivideo/comments/1fx6zdr/gta_iv_wit...

      • DrSiemer 4 hours ago ago

        Cool! Looks fairly consistent as well.

        I wonder if this type of AI upscaling could eventually also fix things like slightly janky animations, but I guess that would be pretty hard without predetermined input and some form of look ahead.

        Limiting character motion to only allow correct, natural movement would introduce a strange kind of input lag.

    • skydhash 6 hours ago ago

      Why not just creating the assets with higher resolution?

      • DrSiemer 4 hours ago ago

        Because that is a lot more work, will only work for a single game, potentially requires more resources to run and will not get you the same level of realism.

  • ilaksh 7 hours ago ago

    I wonder if there is some way to combine this with a language model, or somehow have the language model in the same latent space or something.

    Is that was vision-language models already do? Somehow all of the language should be grounded in the world model. For models like Gemini that can answer questions about video, it must have some level of this grounding already.

    I don't understand how this stuff works, but compressing everything to one dimension as in a language model for processing seems inefficient. The reason our language is serial is because we can only make one sound at a time.

    But suppose the "game" trained on was a structural engineering tool. The user asks about some scenario for a structure and somehow that language is converted to an input visualization of the "game state". Maybe some constraints to be solved for are encoded also somehow as part of that initial state.

    Then when it's solved (by an agent trained through reinforcement learning that uses each dreamed game state as input?), the result "game state" is converted somehow back into language and combined with the original user query to provide an answer.

    But if I understand properly, the biggest utility of this is that there is a network that understands how the world works, and that part of the network can be utilized for predicting useful actions or maybe answering questions etc. ?

    • LarsDu88 6 hours ago ago

      To combine with a language model simply replace the action vector with a language model latent.

      Alternative as of last year there are now purely diffusion based text decoder models

  • mungoman2 11 hours ago ago

    This is getting ridiculous!

    Curious, since this is a strong loop old frame + input -> new frame, What happens if a non-CS image is used to start it off? Or a map the model has never seen. Will the model play ball, or will it drift back to known CS maps?

    • Arch-TK 11 hours ago ago

      Looks like it only knows Dust 2 since every single "dream" (I'm going to call them that since looking at this stuff feels like dreaming about Dust 2) is of that map only.

  • fancyfredbot 11 hours ago ago

    Strangely the paper doesn't seem to give much detail on the cs-go example. Actually the paper explicitly mentions it's limited to discrete control environments. Unless I'm missing something the mouse input for counterstrike isn't discrete and wouldn't work.

    I'm not sure why the title says it was trained on 2x4090 either as I can't see this on either the linked page or the paper. The paper mentions a GPU year of 4090 compute was used to train the Atari model.

    • c1b 11 hours ago ago

      CSGO model is only 1.5 gb & training took 12 days on a 4090

      https://github.com/eloialonso/diamond/tree/csgo?tab=readme-o...

      • fancyfredbot 11 hours ago ago

        Thanks, that's the detail I was looking for on the training. It's amazing results like this can be achieved at such a low costs! I thought this kind of work was out of reach for the GPU poor.

        The part about the continuous control still seems weird to me though. If anyone understands that then very interested to hear more.

  • shahzaibmushtaq 9 hours ago ago

    As I used to play CS 1.6 and CS: GO in my free time before the pandemic, this playable CS diffusion world map has been trained by a noob player for research purposes.

    After reading the comments I can assume that if you play outside of the scope it was trained on, the game loses its functionality.

    Nevertheless, R&D for a good cause is something we all admire.

    • crossroadsguy 9 hours ago ago

      How is the last version CS 2.0 (I think)? It’s been free to play like GO I guess. Is it like GO where physics felt too dramatised (could just be my opinion)? Or realistic in a snappy way like 1.6?

      • shahzaibmushtaq 7 hours ago ago

        Honestly, I heard about CS 2.0 from you. And you are right what you just said about GO.

  • Zealotux 8 hours ago ago

    Could we imagine parts of game elements to become "targets" for models? For example hair and fur physics have been notoriously difficult to nail, but it should be easier to use AI to simulate some fake physics on top of the rendered frame, right? Is anyone working on that?

  • ThouYS 11 hours ago ago

    I don't really understand the intuition on why this helps RL. The original game has a lot more detail, why can't it be used directly?

    • jampekka 11 hours ago ago

      It is used as a predictive model of the environment for model-based RL. I.e. agents can predict consequences of their actions.

      • ThouYS 11 hours ago ago

        Oh, I see. I was somehow under the impression that the simulation was the game the RL agent learns to play (which kinda seemed nonsensical).

    • visarga 10 hours ago ago

      It can use the game directly but if you try this with real life robots, then it is better to do neural simulation before performing an action that could result in injury or damage. We don't need to fall with our cars off the road many times to learn to drive on the road because we can imagine the consequences. Same thing here.

    • FeepingCreature 11 hours ago ago

      In the real world, you can't just boot up a copy of reality to play out strategies. You need an internal model.

      • tourmalinetaco 11 hours ago ago

        So, effectively, these video game models are proof-of-concepts to say “we can make models with extremely accurate predictions using minimal resources”?

        • usrusr 8 hours ago ago

          Not sure where you see the "minimal resources" here? But I'd just counter all questions about "why" with the blanket response of "for understanding natural intelligence": the way biology innovates is that it throws everything against the wall and not pick the one thing that sticks as the winner and focus on that mechanism, it keeps the sticky bits and also everything else as long as their cost isn't prohibitive. Symbolic modeling ("this is an object that can fall down"), prediction chains based on visual similarity patterns (this), hardwired reflexes (we tend to not trust anything that looks and moves like a spider or snake) and who knows what else, it's all there, it all runs in parallel, invited or not, and they all influence each other in subtle and less subtle ways. The interaction is not engineered, it's more like crosstalk that's allowed to happen and has more upside than downside, or else evolution would have preferred variations of the setup that have less of the kind of crosstalk in question. But in our quest to understand us, it's super exciting to see candidates for processes that perhaps play some role in our minds, in isolation, no matter if that role is big or small.

        • vbezhenar 9 hours ago ago

          May be I'm wrong but my understanding is that you can film some area using, say, dashcams and then generate this kind of neuro model. Then you can train robot to walk in this area with this neuro-model. It can perform billions of training sessions without touching physical world. Alternatively you can somehow perform 3D scan of area, recreate its 3D model and use, say, game engine to simulate, but that probably requires more effort and not necessarily better.

          • usrusr 8 hours ago ago

            And the leg motions we sometimes see in sleeping dogs suggest that this is very much a way how having dreams is useful!

  • thenthenthen 11 hours ago ago

    When my game starts to look like this, I know it is time to quit hahha, maybe a helpful tool in gaming addiction therapy? The morphing of the gun/skins and the environment (the sandbags) wow. Would like to play this and see what happens when you walk backwards, turn around quick, use ‘noclip’ :D

  • LarsDu88 6 hours ago ago

    Iterative denoising diffusion is such a hurdle for getting this sort of thing running at reasonable fps

  • advael 11 hours ago ago

    Dang this is the first paper I've seen in a while that makes me think I need new GPUs

  • akomtu 2 hours ago ago

    The current batch of ML models looks a lot like filling in holes in the wall of text, drawings or movies: you erase a part of the wall and tell it to fix it. And it fills in the hole using colors from the nearby walls in the kitchen and similar walls and we watch this in awe thinking it must've figured out the design rules of the kitchen. However what it's really done is it interpolated the gaps with some sort of basic functions, trigonometric polynomials for example, and it used thousands of those. This solution wouldn't occur to us because our limited memory isn't enough for thousands of polynomials: we have to find a compact set of rules or give up entirely. So when these ML models predict the motion of planets, they approximate the Newton's law with a long series of basic functions.

  • w-m 11 hours ago ago

    If you're not bored with it yet, here's a Deep Dive (NotebookLM, generated podcast). I fed it the project page, the arXiv paper, the GitHub page, and the two twitter threads by the authors.

    https://notebooklm.google.com/notebook/a240cb12-8ca1-41b4-ab... (7m59s)

    As always, it's not actually much of a technical deep dive, but gives a quite decent overview of the pieces involved, and its applications.

    • thierrydamiba 10 hours ago ago

      How did you get the output to be so long? My podcasts are 3 mins max…

      • w-m 4 hours ago ago

        Oh wow, really? Even if you feed it whole research papers? The ones I tried until now were more in the 8-10 minute range. I haven’t looked in to how to control the output yet. Hopefully that’ll get a little more transparent and controllable soon.

  • delusional 11 hours ago ago

    I just checked it out right quick. It works perfectly well on an AMD card with ROCM pytorch.

    It seems decent in short bursts. As it goes on it quite quickly loses detail and the weapon has a tendency to devolve into colorful garbage. I would also like to point out that none of the videos show what happens when you walk into a wall. It doesn't handle it very gracefully.

  • gadders 8 hours ago ago

    Cool achievement, but I want AI to give me smarter NPCs, not simulate the map.

    • thelastparadise 8 hours ago ago

      The NPCs need a model of the world in their brain in order to act normal.

  • styfle 8 hours ago ago

    But does it work on macOS?

    (The latest CS removed support for macOS)

  • iwontberude 9 hours ago ago

    This is crazy looking, I know it’s basically useless but it’s cool anyways.

  • mixtureoftakes 10 hours ago ago

    this is crazy

    when trying to run on a mac it only plays in a very small window, how could this be configured?

  • 6510 11 hours ago ago

    Can it use a seed that makes the same map every time?

  • madaxe_again 11 hours ago ago

    I earnestly think this is where all gaming will go in the next five years - it’s going to be so compelling that stuff already under development will likely see a shift to using diffusion models. As this is demonstrating, a sufficiently honed model can produce realtime graphics - and some of the demos floating around where people are running GTA San Andreas through non-realtime models hint as to where this will go.

    I give it the same five years before there are games entirely indistinguishable from reality, and I don’t just mean graphical fidelity - there’s no reason that the same or another model couldn’t provide limitless physics - bust a hole through that wall, set fire to this refrigerator, whatever.

    • qayxc 11 hours ago ago

      I think you're missing the most important point: these models need to be trained on something and that something is a fully developed, working game.

      You're basically saying that game development would need to do the work twice: step 1: develop a fully functional game, step 2: spend ridiculous effort (in terms of time and compute) on training a model to emulate the game in a half-baked fashion.

      It's a solution looking for a problem.

      • manmal 11 hours ago ago

        The world model can still be rendered in very low res, and then the diffusion skin/remaster is applied.

        And this would also be an exciting route to go at remastering old games. I‘d pay a lot to play NFS Porsche again, with photorealism. Or imagine Command & Conquer Red Alert, „rendered“ with such a model.

        • qayxc 11 hours ago ago

          NVIDIA's RTX Remix [1] suite of tools already does that. It doesn't require any model training or dozens of hours of pre-recorded gameplay either.

          You can drop in low-res textures and have AI tools upscale them. Models can be replaced, as well as lighting and the best part: it's all under your control. You're not at the merci of obscure training material that might or might not result in a consistent look-and-feel. More knobs, more control, less compute required.

          [1] https://www.nvidia.com/en-us/geforce/rtx-remix/

          • manmal 10 hours ago ago

            TIL, thanks for posting. The workflow I was sketching out is simpler though: Render a legacy game or low fidelity modern game as-is, and run it through a diffusion model in real time.

      • FeepingCreature 11 hours ago ago

        You can crosstrain on reality.

    • casenmgreen 11 hours ago ago

      Not a chance.

      There are fundamental limitations with what are in the end all essentially neural nets; there is no understanding, only prediction. Prediction alone is not enough to emulate reality, which is why for example genuinely self-driving cars have not, and will not, emerge. A fundamental advance in AI technology will be required for that, something which leads to genuine intelligence, and we are no closer to that than ever we were.

      • fancyfredbot 11 hours ago ago

        Looking at the examples of 2600 games in the paper I'm not sure you can tell that they are just predictions.

        Have you considered how you'd tell the difference between a prediction and understanding in practice?

      • francoisfleuret 11 hours ago ago

        "there is no understanding, only prediction"

        I have no idea what this means.

        • nonrandomstring 10 hours ago ago

          > > "there is no understanding, only prediction"

          > I have no idea what this means.

          You can throw a ball up in the air and predict that it will fall again and bounce. You have no understanding of mass, gravity, acceleration, momentum, impulse, elasticity...

          You can press a button that makes an Uber car appear in reality and take you home. You have no understanding of apps, operating systems, radio, internet, roads, wheels, internal combustion engines, driving, GPS, maps...

          This confusion of understanding and prediction affects a lot of people who use technology in a "machine-like" way, purely instrumental and utilitarian... "how does this get me what I want immediately?"

          You can take any complex reality and deflate it, abstract it, reduce it down to a mere set of predictions that preserve all the utility for a narrow task (in this case visual facsimile) but strip away all depth of meaning. The models, of both the system and the internal working model of the user are flattened. In this sense "AI" is probably the greatest assault on actual knowledge since the book burning under totalitarian regimes of the mid 20th century.

          • binary132 10 hours ago ago

            I think GP is saying that understanding is measured by predictive capability of the theory

            and in case you hadn’t noticed, that kind of uncomprehending slopthink has been going on for a lot longer than the AI fad

          • GaggiX 10 hours ago ago

            What if the model actually understands that the ball will fall and bounce because of mass, gravity, acceleration, momentum, impulse, elasticity? I mean you can just ask ChatGPT and Claude, I guess you would answer that in this case it's just prediction, but if they were human then it would be understanding.

            • nonrandomstring 10 hours ago ago

              > I guess you would answer that in this case it's just prediction,

              No I would answer that it is indeed understanding, to upend your "guess" (prediction) and so prove that while you think you can "predict" the next answer you lack understanding of what the argument is really about :)

              • GaggiX 9 hours ago ago

                I think I understand the topic quite well, since you deliberately deviate from answering the question. You made a practical example that doesn't really work in practice.

        • tourmalinetaco 11 hours ago ago

          The MLM has no idea what it’s making, where you are in the map, what you left behind, and what you picked up. It can accurately predict what comes next, but if you pick up an item and do a 360° turn the item will be back and you can repeat the process.

        • GaggiX 10 hours ago ago

          When a human does it, it's understanding, when an AI does it, it's prediction, I thinks it's very clear /s

          • therouwboat 10 hours ago ago

            Does what? In normal game world things tend to stay where they are without player having to do anything.

            • GaggiX 9 hours ago ago

              We are talking about neural networks in general, not this one or that one, if you train a bad model or the model is untrained it would not indeed understand much or anything.

      • killerstorm 10 hours ago ago

        That's bs. You have no understanding of understanding.

        Hooke's law was pure curve-fitting. Hooke definitely did not understand the "why". And yet we don't consider that bad physics.

        Newton's laws can be derived from curve fitting. How is that different from "understanding"?

        • madaxe_again 10 hours ago ago

          Einstein couldn’t even explain why general relativity occurred. Sure, spacetime is curved by mass, but why? What a loser.

          • killerstorm 8 hours ago ago

            It's very illustrative to look into the history of discovery of laws of motion, as it's quite well documented.

            People have an intuitive understanding of motion - we see it literally every day, we throw objects, etc.

            And yet it took literally thousands of years since discovery of mathematics (geometry, etc.) to formulate a concept of force, momentum, etc.

            Ancient Greek mathematicians could do integration, so they were not lacking mathematical sophistication. And yet their understanding of motion was so primitive:

            Aristotle, an extremely smart man, was muttering something about "violent" and "natural" motion: https://en.wikipedia.org/wiki/Newton%27s_laws_of_motion#Anti...

            People started to understand the conservation of quantity of motion only in 17th century.

            So we have two possibilities:

            * everyone until 17th century was dumb af (despite being able to do quite impressive calculations)

            * scientific discovery is really a heuristic-driven search process where people try various things until they find a good fit

            I.e. millions of people were somehow failing to understand motion for literally thousands of years until they collected enough assertions about motion that they were able to formulate the rule of conservation, test it, and confirm it fits. And only then it became understanding.

            You can literally see conservation of momentum on a billiard table: you "violently" hit one ball, it hits other balls and they start to move, but slower, etc. So you really transfer something from one ball to the rest. And yet people could not see it for thousands of years.

            What this shows is that there's nothing fundamental about understanding: it's just a sense of familiarity, it is a sense that your model fits well. Under the hood it's all prediction and curve fitting.

            We literally have prediction hardware in our brains: cerebellum has specialized cells which can predict, e.g. motion. So people with damaged cerebellum have impaired movement: they still can move, but their movement are not precise. When do you think we find specialized understanding cells in the human brain?

            • mrob 7 hours ago ago

              It seems to me that your evidence supports the exact opposite of your conclusion. Familiarity was only enough to find ad-hoc heuristics for specific situations. It let us discover intuitive methods to throw stones, drive carts, play ball games, etc. but never discovered the general principle behind them. A skilled archer does not automatically know that the same rules can be used to aim a mortar.

              Ad-hoc heuristics are not the same thing as understanding. It took formal reasoning for humans to actually understand motion, of a type that modern AI does not use. There is something fundamental about understanding that no amount of familiarity can substitute for. Modern AI can gain enormous amounts of familiarity but still fail to understand, e.g. this Counter-Strike simulator not knowing what happens when the player walks into a wall.

              • killerstorm 6 hours ago ago

                People found that `m * v` is the quantity which is conserved.

                There's no understanding. It's just a formula which matches the observations. It also matches our intuition (a heavier object is hard to move, etc), and you feel this connection as understanding.

                Centuries later people found that conservation laws are linked to symmetries. But again, it's not some fundamental truth, it's just a link between two concepts.

                LLM can link two concepts too. So why do you believe that LLM cannot understand?

                I middle school I did extremely well in physics classes - I could solve complex problems which my classmates couldn't because I could visualize the physical process (e.g. motion of an object) and link that to formulas. This means I understood it, right?

                Years later I thought "But what *is* motion, fundamentally?". I grabbed Landau-Lifshitz mechanics textbook. How do they define motion? Apparently, bodies move in a way to minimize some integral. They can derive the rest from it. But it doesn't explain what a motion is. Some of the best physicists in the world cannot define it.

                So I don't think there's anything to understanding except feeling of connection between different things. "X is like Y except for Z".

                • mrob 6 hours ago ago

                  Understanding is finding the simplest general solution. Newton's laws are understanding. Catching the ball is not. LLMs take billions of parameters to do anything and don't even generalize well. That's obviously not understanding.

                  • killerstorm 5 hours ago ago

                    You're confusing two meanings of world "understanding":

                    1. Finding a comprehensive explanation

                    2. Having a comprehensive explanation which is usable

                    99.999% people on Earth do not discover any new laws, so I don't think you use #1 as a fundamental deficiency of LLMs.

                    And nobody is saying that just training a LLM produces understanding of new phenomena. That's a strawman.

                    The thesis is that a more powerful LLM together with more software, more models, etc, can potentially discover something new. That's not observed yet. But I'd say it would be weird if LLM can match capabilities of average folk but never match Newton. It's not like Newton's brain is fundamentally different.

                    Also worth noting that formulas can be discovered by enumeration. E.g. `m * v` should not be particularly hard to discover. And the fact that it took people centuries implies that that's what happened: people tried different formulas until they found one which works. It doesn't have to be some fancy Newton magic.

                    • mrob 5 hours ago ago

                      I'm certain that people did not spend centuries trying different formulas for the laws of motion before finding one that worked. The crucial insight was applying any formula at all. Once you have that then the rest is relatively easy. I don't see LLMs making that kind of discovery.

      • madaxe_again 9 hours ago ago

        Yet we have no understanding, only prediction. We can describe a great many things in detail, how they interact - and we can claim to understand things, yet if you recursively ask “why?” everybody, and I mean everybody, will reach a point where they say “I don’t know” or “god”.

        An incomplete understanding is no understanding at all, and I would argue that we can only predict, and we can certainly emulate reality, otherwise we would not be able to function within it. A toddler can emulate reality, anticipate causality - and they certainly can’t be said to be in possession of a robust grand unified theory.

      • jiggawatts 11 hours ago ago

        For simulations like games, it's a trivial matter to feed the neural game engine pixel-perfect metadata.

        Instead of rendering the final shaded and textured pixels, the engine would output just the material IDs, motion vectors, and similar "meta" data that would normally be the inputs into a real-time shader.

        The AI can use this as inputs to render a photorealistic output. It can be trained using offline-rendered "ground-truth" raytraced scenes. Potentially, video labelled in a similar way could be used to give it a flair of realism.

        This is already what NVIDIA DLSS and similar AI upscaling tech uses. The obvious next step is not just to upscale rendered scenes, but to do the rendering itself.

    • viraptor 11 hours ago ago

      It's not that great yet.

      Given a model which can generate the game view in ~real time and a model which can generate the models and textures, why would you ever use the first option, apart from a cool tech demo? I'm sure there's space for new dreamy games where invisible space behind you transforms when you turn around, but for other genres... why? Destructible environment has been possible for quite a while, but once you allow that everywhere, you can get games into unplayable state. They need to be designed around that mechanic to work well: Noita, Worms, Teardown, etc. I don't believe the "limitless physics" would matter after a few minutes.

    • Arch485 11 hours ago ago

      It seems extremely unlikely to me that ML models will ever run entire games. Nobody wants a game that's "entirely indistinguishable from reality" anyways. If they did, they would go outside.

      I think it's possible specific engine components could be ML-driven in the future, like graphics or NPC interactions. This is already happening to a certain degree.

      Now, I don't think it's impossible for an ML model to run an entire game. I just don't think making + running your game in a predictive ML model will ever be more effective than making a game the normal way.

      • jsheard 11 hours ago ago

        Yep, the fuzziness and opaqueness of ML models makes developing an entire game state inside one a non-starter in my opinion. You need precise rules, and you need to be able to iterate on those rules quickly, neither of which are feasible with our current understanding of ML models. Nobody wants a version of CS:GO where fundamental constants like weapon damage run on dream logic.

        If ML has any place in games it's for specific subsystems which don't need absolute precision, NPC behaviour, character animation, refining the output of a renderer, that kind of thing.

    • advael 11 hours ago ago

      I'm not sure that's a warranted assumption based on this result, exciting as it is, we are still seeing replication of an extant testable world model, rather than extrapolation that can produce novel mechanics without them being in the training data. I'm not saying this isn't a stepping stone to that, I just think your prediction's a little optimistic based on the scope of that problem

    • TinkersW 11 hours ago ago

      It requires a monster GPU to run 10 fps at what looks like sub 720p... I think it may be abit more than 5 years..

  • snickerer 11 hours ago ago

    I see where this is going.

    The next step to create training data is a real human with a bodycam. There is only the need to connect the real body movement (step forward, turning left, etc) to typical keyboard and mouse game control events, to feed them into the model, too.

    I think that is what the devs here are dreaming about.

    • CaptainFever 10 hours ago ago

      Or a cockpit cam for the world's most realistic flight simulator. /lighthearted

    • devttyeu 11 hours ago ago

      The "We live in a simulation" argument just started looking a lot more conceivable.

      • tiborsaas 8 hours ago ago

        I'm already very suspicious, we just got the same room number in the third hotel in a row. Someone got lazy with the details :)

      • iwontberude 9 hours ago ago

        Not really because is it simulators all the way down? Simulation theory explains nothing and only adds more unexplainable phenomenon.

    • devttyeu 10 hours ago ago

      Could probably make a decent dataset from VR headset tracking cameras + motion sensors + passthrough output + decoded hand movements

  • TealMyEal 11 hours ago ago

    What's the end goal here? personalised games for everyone. ultra-graphics. i dont really see how this is going to be better than our engine based systems.

    I love being a horse in the 1900s that automobilewill never take off /s

    • visarga 10 hours ago ago

      The goal is to train agents that can imagine consequences before acting. But it could also become a cheap way to create experiences and user interfaces on the fly, imagine if you can have any app UI dreamed up like that, not just games. Generative visual interfaces could be a big leap over text mode.

    • qayxc 11 hours ago ago

      It's a research paper. Not everything that comes out of research has an immediate real-world application in mind.

      Games are just an accessible and easy to replicate context to work in, they're not an end goal or target application.

      The research is about AI agents interacting with and creating world models. Such world models could just as well be alien environments - i.e. the kind of stuff an interstellar and even interplanetary probe would need to be able to do, as two-way communication over large distances is impractical.