吴正龙’s distillation Bookmarks

27 OCT 2025
The reliable performance of on-policy training with the cost-efficiency of a dense reward signal.