11 comments

  • embedding-shape 21 hours ago ago

    Video understanding is kind of new, especially if done well, and hopefully working well with UI and UX, that'd be great. Current agents already struggle a bit with 2D space with normal screenshots of unconventional UIs, wonder if this model would do better with actual recordings of navigating and using applications, feels like it could help a bunch with understanding UX at least hopefully. Will be fun to play around with :)

  • wxw 18 hours ago ago

    What’s SOTA for video understanding? AFAIK most video search is powered by transcription and not the actual video. This seems impressive.

  • bguberfain a day ago ago

    Any plans to port to sglang or vLLM?

    • cleardusk 2 hours ago ago

      vllm-omni support is on the way : )

  • nkvdev a day ago ago

    Great quality, forked and going to try

  • Tsarp a day ago ago

    Nice work. Wish they had picked another name given how popular lance/lancedb is.

  • popalchemist a day ago ago

    Seems like the video output is crippled. Resolution is low (720 or so), as is the frame rate. The samples are shown up-scaled and frame-interpolated.

    Why do that? Seems strange to be building sub-hd resolution video models in 2026.

    • jadbox a day ago ago

      Sure, but again, it's a micro 3B model. Perhaps it can't be used for general video work, but it might be able to do basic edits like remove an object from a table in a shot.

      • MattRix a day ago ago

        It’s not a micro model at all, it requires 40gb of VRAM. The 3B is just the active parameters.

  • asadm a day ago ago

    last dance for lance vance!