吴正龙’s distillation Bookmarks
27 OCT 2025
[Thinking Machines] On-Policy Distillation
The reliable performance of on-policy training with the cost-efficiency of a dense reward signal.
The reliable performance of on-policy training with the cost-efficiency of a dense reward signal.