2 1

Guokai Ma

delock

delock

AI & ML interests

None yet

Recent Activity

commentedon an article about 1 month ago

Muon vs MuonClip vs Muon+AdamW for Fine-Tuning

new activity about 2 months ago

moonshotai/Moonlight-16B-A3B:fix(modeling): add training-path MoE dispatch and KV cache API compat

updated a model about 2 months ago

delock/Moonlight-16B-A3B-finetune-fixed

View all activity

Organizations

None yet

commented on Muon vs MuonClip vs Muon+AdamW for Fine-Tuning about 1 month ago

Hi, I see gradient norm curve comparison between Adam and Muon hybrid, do you also have evaluation loss curve? Is it expected for Muon optimizer have better loss curve than Adam optimizer? Want to hear your insights on this, thanks!

New activity in moonshotai/Moonlight-16B-A3B about 2 months ago

fix(modeling): add training-path MoE dispatch and KV cache API compat

#9 opened about 2 months ago by

delock

updated a model about 2 months ago

delock/Moonlight-16B-A3B-finetune-fixed

Updated Apr 13

published a model about 2 months ago

delock/Moonlight-16B-A3B-finetune-fixed

Updated Apr 13

New activity in microsoft/Phi-3-small-128k-instruct almost 2 years ago

Move flash_attn assert from init into calling func

👍 1

#32 opened almost 2 years ago by

rogerxfeng8

liked a model over 2 years ago

Qwen/Qwen-14B-Chat

Text Generation • 14B • Updated Dec 13, 2023 • 1.77k • 373

Guokai Ma

AI & ML interests

Recent Activity

Organizations

delock's activity

fix(modeling): add training-path MoE dispatch and KV cache API compat

Move flash_attn assert from __init__ into calling func

Move flash_attn assert from init into calling func