Show HN: Llama-8B Teaches Itself Baby Steps to Deep Research Using RL

(github.com)

35 points | by diegocaples 2 days ago ago

3 comments

mdp2021 2 days ago ago
It seems like a very nailed project. I understand this works also as an engine to optimize an interface over a body of knowledge (dataset) you input?
Questions:
-- Does a training over a body of data export into better performance over subsequent bodies of data - as you should also be training meta-skills?
-- Your benchmark revealed a growth from 23% to 53% after an hour: and after further training? If it plateaus, why does it?
[-]
dantodor a day ago ago
Try to use QWen. There has been a paper later that shows the influence of pre-training on the bump they get via RL.