MoE on Richelieu's Blog

DeepSeek V4 MegaMoE Kernel 深度解析

Tue, 02 Jun 2026 00:00:00 +0800

前言

经典 MoE 的计算过程

在深入 MegaMoE 之前，先梳理一下经典 MoE（Mixture of Experts）层的完整计算流程。以一个具体配置为例：

T = 8 # 当前 batch 中的 token 数
H = 7168 # hidden size
I = 2048 # intermediate size（每个 expert 的 FFN 中间维度）
E = 256 # 总 expert 数量
K = 6 # 每个 token 激活的 expert 数（top-K）

第一步：路由（Routing）

输入 hidden_states 形状为 [T, H]（即 [8, 7168]）。

通过 Gate 线性层：

gate = hidden_states @ W_gate^T # W_gate: [E, H]
→ gate: [T, E] = [8, 256] # 每个 token 对每个 expert 的得分

对每个 token 施加 scoring 函数（如 softmax 或 sqrt(softplus)），然后取 top-K：

DeepSeek V4 MoE 量化技术详解

Tue, 02 Jun 2026 00:00:00 +0800

前言

本文整理自一次围绕 vLLM 代码库中 DeepSeek V4 MoE 模块的技术讨论，内容涉及 MXFP4 与 NVFP4 的量化方案对比、Block Quantized GEMM 的设计原理、FP4 packed 存储格式、以及 DeepGEMM 库中 FP8×FP4 在 Blackwell 硬件上的具体实现。

一、DeepSeek V4 MoE 核心优化概览

DeepSeek V4 的 MoE 模块在 vLLM 中的实现包含了大量优化：

优化	说明
DeepGEMM MegaMoE	融合 EP dispatch + L1 GEMM + SwiGLU + L2 GEMM + EP combine 为单 mega-kernel，NVLink 通信与计算重叠
FP4 (MXFP4/NVFP4) 权重量化	4-bit 浮点权重 + UE8M0 block scale
Expert Parallelism 多后端	DeepEP、FlashInfer NVLink、MORI、NIXL 等多种 all-to-all 策略
Fused TopK Bias Routing	sqrt(softplus) 得分函数、e_score_correction_bias、hash MoE
EPLB	每层跟踪 expert 负载，动态重新分配
Fused MLA Kernel	Q-norm + RoPE + KV quant + cache insert 融合为单 CUDA 核
MTP (Multi-Token Prediction)	共享 MoE 架构的 speculative decoding

二、MXFP4 与 NVFP4 的区别

DeepSeek V4 Flash 使用 FP4 权重，有两个可选方案：MXFP4 (OCP 开放标准) 和 NVFP4 (NVIDIA 私有格式)。切换由 HuggingFace config 中的 moe_quant_algo 字段控制。

DeepSeek-v2 Routed Scaling Factor 应用时机详解

Mon, 01 Jun 2026 11:40:00 +0800

背景

DeepSeek-V2/V3 系列模型采用了 MoE（Mixture of Experts）架构，其中 routed_scaling_factor 是一个重要的超参数，用于缩放 routed expert 的输出。该系数来自模型 config，在 DeepseekV2MoE.__init__ 中初始化：

self.routed_scaling_factor = config.routed_scaling_factor

默认值通常为 1.0，但 DeepSeek-V2 系列（如 deepseek-v2、deepseek-coder-v2）设置的典型值是 2.5 或 1.0，取决于具体子模型。

控制开关

在 vLLM 的 deepseek_v2.py 中，关键代码如下：

apply_routed_scale_to_output = not self.is_rocm_aiter_moe_enabled

routed_scaling_factor=self.routed_scaling_factor,
apply_routed_scale_to_output=not self.is_rocm_aiter_moe_enabled,

这个 bool 值决定了 routed_scaling_factor 由谁处理——是 kernel 内部还是 runner 外部。

DeepSeek V3 MoE 模块计算与通信逻辑详解

Mon, 01 Jun 2026 10:10:00 +0800

概述

DeepSeek V3 的 MoE（Mixture of Experts）模块是其核心组成部分，采用 Shared + Routed Expert 架构。本文基于 vLLM 代码库，深入分析其计算流程、通信模式、量化方案以及性能优化策略。