We're running out of benchmarks to upper bound AI capabilities

(lesswrong.com)

15 points | by gmays 18 hours ago ago

9 comments

  • nikisweeting 17 hours ago ago

    We can definitely make harder evals, the problem is a good eval set is indistinguishable from good training data / market edge, so no one is incentivized to share their best eval sets publicly.

  • WarmWash 17 hours ago ago

    Start front loading the models with 5k, 10k, 50k, 100k tokens of messy quasi related context, and then run the benchmarks.

    These models are ridiculously powerful with a blank slate. It's when they get loaded down with all the necessary (and inevitably unnecessary) context to complete the task that they really start to crumble and fold.

    • jballanc 16 hours ago ago

      We need benchmarks that can distinguish between continuous learning and long-context extrapolation.

  • UltraSane 16 hours ago ago

    This is the least true thing ever. All LLMs are terrible at ARC-AGI-3. Every video game can be used as a benchmark. You could rank LLMs on how long they can keep a game of Dwarf Fortress running or how fast they can beat GTA5.

    • ttoinou 15 hours ago ago

      We already have specialized AI to play video games

      • UltraSane 15 hours ago ago

        We are talking about LLMs. a true AGI would be able to beat every video game.

        • conception 15 hours ago ago

          Until Arc-Battletoads is passed I’m not buying it.

          • UltraSane 13 hours ago ago

            More like ARC-SegaMasterSystem-ALF

  • refactorbench 15 hours ago ago

    [dead]