Nvidia releases NVLM 1.0 72B open weight model

(huggingface.co)

153 points | by mirekrusin a day ago ago

54 comments

  • imjonse 20 hours ago ago

    It is a family of multimodal models based on pretrained Qwen2-72B-Instruct LLM and InterViT vision encoder. There are three variants differentiated by the way the vision tokens are used: decoder-only (like the majority of existing VLM), using cross-attention, and a hybrid. Only the first seems to be on huggingface at the moment.

    Also they seem to only train on publically available data, concluding that quality is more important than scale.

  • keyboardsamurai 21 hours ago ago

    It has a non-commercial cc-by-nc-4.0 license, I would guess the only way to use this in production is to use Nvidias data centers to host it? Or are there other ways?

    • orlp 20 hours ago ago

      Not a lawyer, not legal advice, but... the legal status quo is that neural network outputs are not copyrightable. They are currently considered not made by humans nor considered a derivative work from the training material / network weights (assuming it's not regurgitating copyrighted material verbatim).

      The cc-by-nc-4.0 license applies to the network weights. The only thing non-commercial about the license is that it restricts how you may reproduce the licensed material:

      > reproduce and Share the Licensed Material, in whole or in part, for NonCommercial purposes only; and

      As long as you are not selling the network weights themselves, nothing in the license prevents you from evaluating the neural network for commercial purposes and selling the outputs. In 'production' you will have to directly download the weights from Nvidia themselves (or another 3rd party which is distributing the network weights non-commercially in good faith) though, you can't share the network weights onto your commercial inference server from another one of your commercial deployment servers. Or at least, it gets more dicy there and may be considered commercial reproduction so better avoid it.

      For similar reasons you may 3D print a CC-BY-NC model of a tool and use that tool in your commercial workshop, you may use a CC-BY-NC compiler of a language to compile commercial programs, etc.

      • SonOfLilit 18 hours ago ago

        Not a lawyer, but work with lawyers a lot, and this type of rules-lawyering doesn't tend to work in the legal profession. Consult a lawyer before trying any of this.

      • Majromax 14 hours ago ago

        > The cc-by-nc-4.0 license applies to the network weights.

        I'm not even sure if network weights are copyrightable independently of the code and data used to generate them. In my personal (not a lawyer) view, the weights of a neural network are the product of a mechanical transformation process much like a compiler or assembler, and we don't consider a compiled binary to have a copyright independent of its source code.

        I still wouldn't notoriously try to violate a purported weights license, mind you, both because it's rude to ignore the authors' wishes and because it would not be fun being used by NVidia or any other deep-pocket AI company.

      • dindresto 19 hours ago ago

        First time I read this interpreation regarding CC-BY-NC model weights, are there any sources to back it?

      • Tepix 18 hours ago ago

        It's an interesting question indeed!

        Creative Commons themselves write at https://creativecommons.org/faq/#can-i-apply-a-creative-comm... :

        "Can I apply a Creative Commons license to software? We recommend against using Creative Commons licenses for software. Instead, we strongly encourage you to use one of the very good software licenses which are already available."

        Of course, LLM weights aren't traditional software...

      • impossiblefork 18 hours ago ago

        Even selling the network weights shouldn't matter, since there's no copyright.

        The problem is if you happen to sign any agreement with NVIDIA in order to get the weights. The problem is whatever contracts you may be bound by.

      • resource_waste 16 hours ago ago

        > the legal status quo is that neural network outputs are not copyrightable.

        Can't this flip on a dime and a billion dollar company lose billions?

  • rd42 18 hours ago ago

    I think the only relevant part to note here is that this model showed improved text-only performance after multimodal training. Wonder if this translates to Llama models also ? Is it possible to extend Llama 3.1 405b with multi-modal training to create another SOTA large model ?

    • reissbaker 15 hours ago ago

      I think the answer here is "it depends." The Llama-3.2 series is an extended version of the Llama-3.1 series with multimodal (image) training, but they kept the language model weights frozen and only updated the new image weights. So in the end, the 3.2 series benchmarks identically to 3.1 on text-only tasks; the image weights provided no value to the language model weights.

      Allowing the language model weights to be updated during training could potentially result in better performance on both tasks, though, if Nvidia's result replicates. I could believe that it might: after all, more diverse data is more diverse data, and the model will be forced during training to generalize more.

    • imjonse 18 hours ago ago

      Llama-3-V models do that, but are not published.

  • jftuga 17 hours ago ago

    How much GPU RAM would be needed to run this with just one GPU?

    • reissbaker 15 hours ago ago

      144GB VRAM to load the weights at FP16, 72GB quantized to FP8. To figure out the KV cache size you'll need for an LLM, you can use the following formula: https://x.com/AlpinDale/status/1841305040545329535

      Simplified for posterity:

          kv_bytes = kv_bits / 8
          hidden_per_head = hidden_size // num_attention_heads
          total_heads = hidden_per_head * num_key_value_heads
          kv_bytes_per_token = 2 * kv_bytes * num_hidden_layers * total_heads
      
      (Edit: I accidentally swapped in some of the vision config bytes in my original calculation; these are the corrected numbers.) So, for NVLM 1.0 72B, that works out to 640kb per token assuming FP16 KV cache. If you use the entire 32k context length, that's an extra ~20GB of overhead for the KV cache. Then depending on how you're running the LLM, there might be extra overhead e.g. compiled CUDA graphs.

      You can cut this down lower by using grouped query attention as described here: https://medium.com/@plienhar/llm-inference-series-4-kv-cachi... This allows you to divide that number by the number of grouped heads, although it trades off accuracy for VRAM usage.

      But TLDR, a minimum of around 164GB of VRAM at full accuracy. To me that seems fairly low, and I think vLLM would OOM without significantly more than that, but that's about as low as you could go in theory if you're running everything at FP16. Half that, of course, for FP8.

      You'll typically need to have a copy of the KV cache per GPU, if you're using multiple GPUs, so multiply the KV cache overhead by the number of GPUs you're using. This will depend on what the specs for the GPUs you're using are; for example, you'll need 3 H100s (really four, since vLLM wants the number of heads to be evenly divisible by the number of GPUs); if you're using L40Ses, you'll need eight of them; but most likely only a single AMD MI300x.

    • paulluuk 17 hours ago ago

      I haven't tested it, but likely around 170GB, regardless of if you're using only one GPU or spreading it out over several ones.

  • optimalsolver 21 hours ago ago

    Reminder that Nvidia is still the only company making any money out of the "AI revolution".

    • danpalmer 21 hours ago ago

      That's natural given that they mostly produce hardware several layers of abstraction distant from the end user value, companies need to buy the hardware before they can start delivering their own value. AI model training is not value by itself if there's no use-case for the model that can be charged for.

      I see it playing out one of two ways. Either Nvidia are selling shovels in a gold rush, the rush will end, and the business will dry up (after they have made a lot of money!). Or AI sticks/takes off, and Nvidia are selling a commodity too far from the value, like most electronic component manufacturers, and they'll maintain significant market share but have their margins reduced to a fraction of what they were before (after they made a lot of money!).

      The human value doesn't come from ML training or inference, it comes from taking a better photo. The business value comes from drafting a better email. Those companies closer to that value will likely do better in the long run, as they always have done.

    • Bloedcoins 17 hours ago ago

      I'm pretty sure https://www.topazlabs.com/ is also making money with the AI revolution.

      Also Klarna threw out 700 people, they probably make money with AI.

      And i found this article: https://www.ft.com/content/a9a192e3-bfbc-461e-a4f3-112e63d0b...

    • Bloedcoins 17 hours ago ago

      Its an revolution. Don't undersell this.

      There was never ever any technology like LLMs close to what chatgpt and co can do in regards of understanding random human input.

      My startup doesn't need to make money with it directly, but for us it increased our data quality on text and images.

      I'm also quite happy to pay 10-20$ per month for random things LLMs do quite well for different use cases like creating some scripts etc.

    • a2128 21 hours ago ago

      "When there is a gold rush, sell shovels"

      • amelius 18 hours ago ago

        They started the gold rush.

        • jiggawatts 17 hours ago ago

          I'm pretty sure OpenAI started it, they just used NVIDIA shovels to dig the first mines.

          • throwaway48476 17 hours ago ago

            Nvidia created CUDA and seeded the ML industry for a decade before chatgpt. They aren't given enough credit for their foresight and strategy. Most companies would have choked the community to death with greed before it ever took off.

            There is a reason why CUDA works on every NV gpu but ROCm support is spotty at best and only guaranteed on data center GPUs.

            • jiggawatts 17 hours ago ago

              My analogy still holds. NVIDIA just created good shovels that are useful in both the garden and in a gold mine.

              AMD and Intel insisted on selling only flimsy garden shovels.

              • throwaway48476 17 hours ago ago

                AMD and intels shovels (hardware) are fine. The ecosystem is the problem. The fundamental difference is AMD/intel see it as an upsell whereas nvidia is willing to invest in long term organic growth. The problem is the C suite and the difference between companies run by founders and bean counters.

                • jiggawatts 17 hours ago ago

                  We're actually in agreement, it's just that analogies are a blunt instrument.

                  I'm saying that Intel and AMD made single-purpose GPUs useful only for graphics. Whether that's because of the software or hardware is immaterial. Effectively, it's one product in the same sense that an iPhone is one product to a consumer, but technically it's the iPhone device + iOS the software + Apple services such as iCloud, music, etc...

                  • throwaway48476 17 hours ago ago

                    It's not single purpose hardware or software. If you crawl over enough broken glass you can get anything to work on AMD/intel.

                    The distinction is one of business strategy not technology.

    • Der_Einzige 20 hours ago ago

      Wrong

      Midjourney is profitable. All the acquired startups (i.e. Streamlit or MosaicML) who made millions per employee "made money" for the people who cared.

      • dartos 18 hours ago ago

        Midjourney is one, but the others are not. Plenty of people “made money” at Twitter, but the company is a money pit.

        OP was likely talking about profitability.

        FWIW I wouldn’t really count streamlit as an ai company

        • saagarjha 18 hours ago ago

          Twitter was (mildly) profitable.

    • GaggiX 21 hours ago ago

      That's not true, there are plenty of companies that make a profit, Midjourney, for example, an obvious one.

      • dartos 18 hours ago ago

        Are there others?

        • GaggiX 14 hours ago ago

          I use NovelAI and that's also profitable. I would be surprised if Elevenlabs wasn't profitable right now.

    • Refusing23 20 hours ago ago

      i have yet to hear of anyone actually using AI for something properly

      only exception im excited about is the non-main characters from video games, where a lot of the random NPCs, can now actually bring some more fun to the game.

      • PeterStuer 19 hours ago ago

        I run in production a system that uses LLM translation and summerization from hundreds of sources in dozens of languages. Users are extremely satisfied by the results that are far cheaper and far higher quality than what was available before

        • Filligree 11 hours ago ago

          Which system is this?

      • Bloedcoins 17 hours ago ago

        I have seen plenty of very good internal AI Demos which we are adding to our products. From GenAI stuff, to image analysis, lightweight agents who answer proper questions.

        I used chatgpt 3 days ago to generate a script for me. Saved me probably an hour too.

        We use it also in my startup for tasks which we wouldn't even tried without ML models because the quality of old libraries were to bad. Like pdf catalog to text, image classification and segmentation.

      • lynx23 20 hours ago ago

        Vision models are a godsent for blind user. I use a vision model to sort my laundry, for instance...

        And translation and grammar/spell checking is also at a level which was unthinkable before LLMs hit.

        But thats it, really. The "talking machine" aspect of it is more and more uncovered as totally useless.

        • riffraff 20 hours ago ago

          > I use a vision model to sort my laundry

          you built a robot that sorts laundry? Tell us more!

          • lynx23 20 hours ago ago

            No, I never said that. But you already know that. The robot in this case is me holding a smart phone.

            • indigo945 19 hours ago ago

              Is that faster than just determining by touch what type of garment something is? Or is this about sorting by color?

              • lynx23 19 hours ago ago

                Its for sorting by color/print. Some things you remember instantly by touch, others not so much.

                • 1dom 18 hours ago ago

                  This sounds really cool - so you point it at individual items of clothing and it reads out the type of clothing and colour?

                  Do you have any more info or links about the setup?

                  • Bloedcoins 17 hours ago ago

                    https://www.bemyeyes.com/ you can scroll down to the new AI version.

                  • lynx23 16 hours ago ago

                    Its basically a gpt4o in disguise. The feature is called BeMyAI, and it is being released via BeMyEyes.

                    I would have answered earlier, but the silly HN rate limiter prevented me from passing the link to you.

                    I dont want to look it up yet agan.

                    And I dont want to use HN anymore,, this rate limit time-waster really just killed my sympathy for this site.

      • tourmalinetaco 18 hours ago ago

        Claiming no one is using MLMs “properly” despite the various scientific and industrial use cases (vision systems, robots, protein folding, drug simulation, etc) while being “excited” for something as pathetically trivial as a text generator with a text-to-speech tacked on for your mass-produced open world games. Truly peak HN.

  • 15 hours ago ago
    [deleted]
  • cjtrowbridge 21 hours ago ago

    I love how they include a helpful chart that shows this model scores worse than everything else.

    • kibibu 21 hours ago ago

      Am I looking at the wrong table? It dominates everything on visual interpretation benchmarks.

      Edit: specifically ocrbench and VQAv2

    • butterfly42069 21 hours ago ago

      All jokes aside (and that did make me laugh) at least they're not training just to hit the benchmarks, which seem to be more meaningless as a quality indicator with each passing day.

    • miffy900 20 hours ago ago

      I see at a few models (3 models in MMMU) that score lower than Nvidia's. But putting that aside, they at least get points for apparent objectivity. At least they probably aren't fudging numbers.

    • Der_Einzige 21 hours ago ago

      It's not that bad, and I'd much rather that they be honest instead of lying like everyone else does.

    • GaggiX 21 hours ago ago

      Well but it actually doesn't, unless you're looking only at MMMU.

      • dr_kiszonka 11 hours ago ago

        Exactly. On some benchmarks it is close to or better than GPT 4o.

        I wonder if one of the reasons they released it was to respond to OpenAI's plans to enter the chipmaking market.