Modded-NanoGPT: NanoGPT (124M) quality in 3.25B tokens

(github.com)

78 points | by ocean_moist 19 hours ago ago

9 comments

Scene_Cast2 18 hours ago ago
I wonder how much improvement is owed to which changes. I've also never heard of "Muon - Momentum Orthogonalized by Newton-schulz" being used.
EDIT: there's a bit more info on his twitter - https://x.com/kellerjordan0
It looks like he created this optimizer. Works on 2D matrices only.
molticrystal 16 hours ago ago
Just needs a Zero To Hero series episode offering line by line commentary to follow along on why each choice was made over alternatives.
whiplash451 18 hours ago ago
Cool work. No license?
byyoung3 11 hours ago ago
do you have a baseline of the regular implementation with 3x learning rate?
m3kw9 16 hours ago ago
So it compresses info better.
[-]
gavindean90 18 hours ago ago
Seems like this is a modded NanoGPT not the original.
[-]