技术收藏夹
一些我在 LLM 推理优化领域读过、正在读、想读的论文、关注的开源项目和日常工具。
📄
Paper
2024-05
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language ModelDeepSeek-V2 提出 Multi-head Latent Attention (MLA) 和 DeepSeekMoE,实现高效推理
📄
Paper
2024-12
DeepSeek-V3 Technical ReportDeepSeek-V3 671B 模型技术报告:FP8 训练、Multi-Token Prediction、负载均衡
📄
Paper
2025-02
TurboQuant: LLM KV Cache Compression via Hadamard Transform & Lloyd-Max QuantizationICLR 2026:用 Hadamard 旋转 + Lloyd-Max 最优标量量化实现 3-4 bit KV Cache 压缩
📄
Paper
2023-09
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttentionSOSP 2024:PagedAttention + vLLM 推理引擎的经典论文
📄
Paper
2023-12
SGLang: Efficient Execution of Structured Language Model ProgramsRadicsAttention + 编译器优化驱动的 LLM 推理框架
📄
Paper
2025-01
Fast and Expressive LLM Inference with RadixAttention and SGLangSGLang v0.4 技术报告:RadixAttention + Tree Attention + 编译器优化
📄
Paper
2022-05
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-AwarenessNeurIPS 2022:IO-aware 的精确注意力计算,tiling 和 recomputation 策略
📄
Paper
2023-07
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning改进的 work partitioning 和并行策略,2× 加速
📄
Paper
2022-11
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language ModelsICML 2024:通过平滑 activation 中的 outlier 实现 W8A8 量化
📄
Paper
2023-06
AWQ: Activation-aware Weight Quantization for LLM Compression and AccelerationMLSys 2024:基于 activation 分布选择重要权重的 INT4 量化
🔧
Tool
2023-03
NVIDIA H100 Tensor Core GPU ArchitectureHopper 架构白皮书:FP8 Tensor Core、Transformer Engine、DPX 指令
📄
Paper
2019-09
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model ParallelismNVIDIA 的分布式训练框架:Tensor Parallelism、Pipeline Parallelism、序列并行
No bookmarks match your criteria.