Scaling Factor

背景 DeepSeek-V2/V3 系列模型采用了 MoE（Mixture of Experts）架构，其中 routed_scaling_factor 是一个重要的超参数，用于缩放 routed expert 的输出。该系数来自模型 config，在 DeepseekV2MoE.__init__ 中初始化： self.routed_scaling_factor = config.routed_scaling_factor 默认值通常为 1.0，但 DeepSeek-V2 系列（如 deepseek-v2、deepseek-coder-v2）设置的典型值是 2.5 或 1.0，取决于具体子模型。控制开关在 vLLM 的 deepseek_v2.py 中，关键代码如下： apply_routed_scale_to_output = not self.is_rocm_aiter_moe_enabled routed_scaling_factor=self.routed_scaling_factor, apply_routed_scale_to_output=not self.is_rocm_aiter_moe_enabled, 这个 bool 值决定了 routed_scaling_factor 由谁处理——是 kernel 内部还是 runner 外部。 ...