1 comments

  • jonathanlight 5 hours ago ago

    Hi HN — I'm one of the authors.

    This paper looks at a problem that comes up in RL post-training of large models: the training data mixture (or curriculum) is often manually tuned and static, even though the policy keeps changing during training.

    We propose Actor-Curator, a framework where a learned "curator" adaptively selects training problems while the actor policy is being optimized. The curator is trained to maximize a policy-improvement objective, effectively learning which data is most useful for improving the policy at each stage of training.

    Conceptually it’s a co-adaptive system: - the actor learns the policy - the curator learns the training curriculum

    Happy to answer questions or discuss!