吴正龙’s mixture-of-experts Bookmarks

27 DEC 2024
To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures.
11 DEC 2023
MoEs are pretrained much faster versus dense models, have faster inference compared to a model with the same number of parameters, and require high VRAM as all experts are loaded in memory.