Hi, I see gradient norm curve comparison between Adam and Muon hybrid, do you also have evaluation loss curve? Is it expected for Muon optimizer have better loss curve than Adam optimizer? Want to hear your insights on this, thanks!
Guokai Ma
delock
AI & ML interests
None yet
Recent Activity
commentedon an article about 1 month ago
Muon vs MuonClip vs Muon+AdamW for Fine-Tuning new activity about 2 months ago
moonshotai/Moonlight-16B-A3B:fix(modeling): add training-path MoE dispatch and KV cache API compat updated a model about 2 months ago
delock/Moonlight-16B-A3B-finetune-fixedOrganizations
None yet