This paper looks at a problem that comes up in RL post-training of large models: the training data mixture (or curriculum) is often manually tuned and static, even though the policy keeps changing during training.
We propose Actor-Curator, a framework where a learned "curator" adaptively selects training problems while the actor policy is being optimized. The curator is trained to maximize a policy-improvement objective, effectively learning which data is most useful for improving the policy at each stage of training.
Conceptually it’s a co-adaptive system:
- the actor learns the policy
- the curator learns the training curriculum
Hi HN — I'm one of the authors.
This paper looks at a problem that comes up in RL post-training of large models: the training data mixture (or curriculum) is often manually tuned and static, even though the policy keeps changing during training.
We propose Actor-Curator, a framework where a learned "curator" adaptively selects training problems while the actor policy is being optimized. The curator is trained to maximize a policy-improvement objective, effectively learning which data is most useful for improving the policy at each stage of training.
Conceptually it’s a co-adaptive system: - the actor learns the policy - the curator learns the training curriculum
Happy to answer questions or discuss!