It isn't as impressive as they'd like you to believe.
They fine tuned it for that test.
What we are seeing is a marketing trick to keep markets and investors excited about AI. It's a trillion dollar industry for NVidia and other players. Fake it until you make it.
If you look deeper, there's very little change since GPT-3.5 and Anthropic have catched up with everything OpenAI has built so far.
Sora was a huge fluke with other companies clearly ahead of it. Also mostly useless.
Is it not true that The Arc test is designed to be one where the rules are dynamic? i.e every one of the tests are different from each other in an absolute sense. Learning about one tells you nothing of substance about the other unless of course you/the model is capable of meta-learning
Finetuning has been looked down upon because all it does is rearrange weight to learn style of the finetuning dataset. It does not teach the model anything which is in contrast to the hopes behind finetuning
If a model was able to ace the arc-test just by the merit of being finetuned, does it not imply there is something of absolute substance here? i.e the model is capable of meta-learning and all it needs to adapt to a new-task is a bit of finetuning which again I emphasize is the loweest tier in the ranks of types of training models
With all the hype from o1 and gpt-4o, sonnet-3.5 still performs better on the field. OpenAI, Google, and Qwen has been smashing all these benchmarks, but they have less market share on the field itself compared to the little guy ignoring benchmarks. OpenAI used to be that little guy.
It isn't as impressive as they'd like you to believe.
They fine tuned it for that test.
What we are seeing is a marketing trick to keep markets and investors excited about AI. It's a trillion dollar industry for NVidia and other players. Fake it until you make it.
If you look deeper, there's very little change since GPT-3.5 and Anthropic have catched up with everything OpenAI has built so far.
Sora was a huge fluke with other companies clearly ahead of it. Also mostly useless.
The numbers don't add up.
Is it not true that The Arc test is designed to be one where the rules are dynamic? i.e every one of the tests are different from each other in an absolute sense. Learning about one tells you nothing of substance about the other unless of course you/the model is capable of meta-learning
Finetuning has been looked down upon because all it does is rearrange weight to learn style of the finetuning dataset. It does not teach the model anything which is in contrast to the hopes behind finetuning
If a model was able to ace the arc-test just by the merit of being finetuned, does it not imply there is something of absolute substance here? i.e the model is capable of meta-learning and all it needs to adapt to a new-task is a bit of finetuning which again I emphasize is the loweest tier in the ranks of types of training models
yeah, you're right, the poster above you is just in denial
With all the hype from o1 and gpt-4o, sonnet-3.5 still performs better on the field. OpenAI, Google, and Qwen has been smashing all these benchmarks, but they have less market share on the field itself compared to the little guy ignoring benchmarks. OpenAI used to be that little guy.
What about if they fine tune it to do your job?
Just like human, i expect ai to make mistake on first iterations. But after a time, it is a chaos.