DeepSeek-v2 Routed Scaling Factor 应用时机详解

背景

DeepSeek-V2/V3 系列模型采用了 MoE（Mixture of Experts）架构，其中 routed_scaling_factor 是一个重要的超参数，用于缩放 routed expert 的输出。该系数来自模型 config，在 DeepseekV2MoE.__init__ 中初始化：

self.routed_scaling_factor = config.routed_scaling_factor

默认值通常为 1.0，但 DeepSeek-V2 系列（如 deepseek-v2、deepseek-coder-v2）设置的典型值是 2.5 或 1.0，取决于具体子模型。

控制开关

在 vLLM 的 deepseek_v2.py 中，关键代码如下：

apply_routed_scale_to_output = not self.is_rocm_aiter_moe_enabled

routed_scaling_factor=self.routed_scaling_factor,
apply_routed_scale_to_output=not self.is_rocm_aiter_moe_enabled,

这个 bool 值决定了 routed_scaling_factor 由谁处理——是 kernel 内部还是 runner 外部。

路径 A：CUDA（`apply_routed_scale_to_output=True`）

当在 NVIDIA CUDA 上运行时，is_rocm_aiter_moe_enabled 为 False，因此 apply_routed_scale_to_output=True。

Step 1: FusedMoE `init`

在 layer.py 中，内部 member 被设为 1.0，kernel 内部不关心这个系数：

self.routed_scaling_factor = (
    routed_scaling_factor if not apply_routed_scale_to_output else 1.0
)

Step 2: 创建 runner 时

runner 持有的 routed_scaling_factor 为真实值，后续用于在 Python 侧做后处理缩放：

routed_scaling_factor=routed_scaling_factor
    if apply_routed_scale_to_output
    else 1.0,

Step 3: Kernel forward

kernel（如 MarlinMoE / FusedMoEKernel）内部拿到的 routed_scaling_factor=1.0，不做额外缩放，直接输出原始的加权和。

Step 4: Runner 后处理

在 moe_runner.py 中，_maybe_apply_routed_scale_to_output 方法对输出进行缩放：

def _maybe_apply_routed_scale_to_output(self, shared_output, fused_output):
    if self.routed_scaling_factor != 1.0:
        if fused_output.dtype != torch.float16 or shared_output is None:
            fused_output *= self.routed_scaling_factor
        elif shared_output is not None:
            shared_output *= 1.0 / self.routed_scaling_factor
    return shared_output, fused_output

FP16 时的处理值得注意：直接对 fused_output 乘 routed_scaling_factor（如 2.5）可能导致数值溢出。因此 vLLM 采取了一个巧妙的策略——反过来缩小 shared_output（除以 routed_scaling_factor），在后续 shared_output + fused_output 相加时达到等价的缩放效果。

路径 B：ROCm aiter（`apply_routed_scale_to_output=False`）

当使用 AMD ROCm 且启用了 aiter kernel 时，is_rocm_aiter_moe_enabled=True，apply_routed_scale_to_output=False。

Step 1: FusedMoE `init`

内部 member 保留真实值，aiter kernel 内部可以直接访问：

self.routed_scaling_factor = routed_scaling_factor

Step 2: 创建 runner 时

runner 拿到的 routed_scaling_factor=1.0，不做后处理：

routed_scaling_factor=1.0

Step 3: Aiter kernel forward

aiter kernel 内部直接读取 routed_scaling_factor，在计算完每个 expert 的加权输出后，在 GPU kernel 内部乘以该系数。kernel 返回的 fused_output 已经是缩放后的结果。

Step 4: Runner 后处理

_maybe_apply_routed_scale_to_output 看到 routed_scaling_factor=1.0，直接 pass。

总结对照表

阶段	CUDA	ROCm aiter
FusedMoE `self.routed_scaling_factor`	`1.0`（kernel 忽略）	真实值（kernel 可用）
runner `routed_scaling_factor`	真实值	`1.0`
缩放执行位置	`_maybe_apply_routed_scale_to_output`（CPU/Python 侧）	aiter kernel 内部（GPU 侧）
FP16 防溢出	有（除到 shared_output）	依赖 kernel 实现

核心结论

两种路径最终效果完全等价，只是 routed_scaling_factor 的执行时机和位置不同：

CUDA 路径：kernel 输出原始值，runner 在 CPU 侧后处理乘法
ROCm aiter 路径：aiter kernel 内部直接完成缩放，runner 什么都不做

根本原因在于：ROCm 的 aiter 库要求 scaling factor 在 kernel 内部处理，而 CUDA 的 marlin 等 kernel 不支持内部缩放，所以需要 runner 在外部后处理。这种设计体现了 vLLM 在异构硬件支持上的工程取舍——通过一层抽象，让同一套模型代码在 CUDA 和 ROCm 上都能正确工作，且保证数值结果一致。

背景#

控制开关#

路径 A：CUDA（apply_routed_scale_to_output=True）#

Step 1: FusedMoE __init__#

Step 2: 创建 runner 时#

Step 3: Kernel forward#

Step 4: Runner 后处理#

路径 B：ROCm aiter（apply_routed_scale_to_output=False）#

Step 1: FusedMoE __init__#

Step 2: 创建 runner 时#

Step 3: Aiter kernel forward#

Step 4: Runner 后处理#

总结对照表#

核心结论#

背景

控制开关

路径 A：CUDA（`apply_routed_scale_to_output=True`）

Step 1: FusedMoE `init`

Step 2: 创建 runner 时

Step 3: Kernel forward

Step 4: Runner 后处理

路径 B：ROCm aiter（`apply_routed_scale_to_output=False`）

Step 1: FusedMoE `init`

Step 2: 创建 runner 时

Step 3: Aiter kernel forward

Step 4: Runner 后处理

总结对照表

核心结论