Modded-NanoGPT: NanoGPT (124M) quality in 3.25B tokens

(github.com)

78 points | by ocean_moist 19 hours ago ago

9 comments

  • Scene_Cast2 18 hours ago ago

    I wonder how much improvement is owed to which changes. I've also never heard of "Muon - Momentum Orthogonalized by Newton-schulz" being used.

    EDIT: there's a bit more info on his twitter - https://x.com/kellerjordan0

    It looks like he created this optimizer. Works on 2D matrices only.

  • molticrystal 16 hours ago ago

    Just needs a Zero To Hero series episode offering line by line commentary to follow along on why each choice was made over alternatives.

  • whiplash451 18 hours ago ago

    Cool work. No license?

  • byyoung3 11 hours ago ago

    do you have a baseline of the regular implementation with 3x learning rate?

  • m3kw9 16 hours ago ago

    So it compresses info better.

  • gavindean90 18 hours ago ago

    Seems like this is a modded NanoGPT not the original.

    • munchler 18 hours ago ago

      Yes. It’s literally called “Modded-NanoGPT”.