技术收藏夹

一些我在 LLM 推理优化领域读过、正在读、想读的论文、关注的开源项目和日常工具。

📄 Paper 2024-05
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-V2 提出 Multi-head Latent Attention (MLA) 和 DeepSeekMoE,实现高效推理

MoEDeepSeekMLALLM
📄 Paper 2024-12
DeepSeek-V3 Technical Report

DeepSeek-V3 671B 模型技术报告:FP8 训练、Multi-Token Prediction、负载均衡

MoEDeepSeekMulti-Token PredictionFP8
📄 Paper 2025-02
TurboQuant: LLM KV Cache Compression via Hadamard Transform & Lloyd-Max Quantization

ICLR 2026:用 Hadamard 旋转 + Lloyd-Max 最优标量量化实现 3-4 bit KV Cache 压缩

quantizationKV CacheTurboQuant
📄 Paper 2023-09
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

SOSP 2024:PagedAttention + vLLM 推理引擎的经典论文

vLLMPagedAttentionLLM Inference
📄 Paper 2023-12
SGLang: Efficient Execution of Structured Language Model Programs

RadicsAttention + 编译器优化驱动的 LLM 推理框架

SGLangLLM InferenceCompilation
📄 Paper 2025-01
Fast and Expressive LLM Inference with RadixAttention and SGLang

SGLang v0.4 技术报告:RadixAttention + Tree Attention + 编译器优化

SGLangRadixAttentionTree Attention
📄 Paper 2022-05
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

NeurIPS 2022:IO-aware 的精确注意力计算,tiling 和 recomputation 策略

AttentionCUDATiling
📄 Paper 2023-07
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

改进的 work partitioning 和并行策略,2× 加速

AttentionCUDATiling
📦 Repo 2023-06
vLLM

高性能 LLM 推理引擎,支持 PagedAttention、Continuous Batching、多种量化

vLLMInferenceCUDAPython
📦 Repo 2024-01
SGLang

结构化 LLM 程序执行引擎,RadixAttention + 编译器优化

SGLangInferencePythonCUDA
📦 Repo 2023-03
llama.cpp

C/C++ 实现的 LLM 推理,GGUF 格式支持,广泛用于边缘设备

GGMLInferenceC++Quantization
📦 Repo 2023-08
TensorRT-LLM

NVIDIA 的 LLM 推理优化平台,支持 FP8/INT4/INT8 量化

TensorRTNVIDIAInferenceC++
📦 Repo 2024-03
Transformer-Operator (torchao)

PyTorch 官方量化/剪枝/稀疏化库,支持 INT4/INT8/FP8 量化

PyTorchQuantizationCUDAPruning
🔧 Tool 2024-01
CUDA Toolkit Documentation

CUDA 官方文档,CUDA C++ Programming Guide 是必读

CUDANVIDIAGPU
🔧 Tool 2024-01
NVIDIA Nsight Compute

CUDA kernel 性能分析器,roofline 分析、occupancy 分析等

CUDAProfilingGPUNVIDIA
🔧 Tool 2024-01
Hugging Face LLM Leaderboard

开源 LLM 排行榜,跟踪各模型在 MMLU/GSM8K/HumanEval 上的表现

HuggingFaceBenchmarkLLM
📝 Blog 2024-10
PyTorch LLM Inference Optimization (官方教程)

PyTorch 团队关于 LLM 推理性能优化的综合指南

PyTorchInferenceOptimization
📄 Paper 2025-02
The Super Weight in LLM Quantization

SpinQuant / QuIP# 等工作中发现的 LLM 中有少数权重对量化特别敏感

quantizationLLMoutlier
📄 Paper 2022-11
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

ICML 2024:通过平滑 activation 中的 outlier 实现 W8A8 量化

quantizationW8A8LLM
📄 Paper 2023-06
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

MLSys 2024:基于 activation 分布选择重要权重的 INT4 量化

quantizationW4A16LLM
📄 Paper 2025-02
BlockLM: Block-wise Quantization for LLMs

逐块量化的新范式,在保持精度的前提下实现更高的压缩比

quantizationblock-wiseLLM
🔧 Tool 2023-03
NVIDIA H100 Tensor Core GPU Architecture

Hopper 架构白皮书:FP8 Tensor Core、Transformer Engine、DPX 指令

NVIDIAGPUH100FP8Tensor Core
📄 Paper 2019-09
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

NVIDIA 的分布式训练框架:Tensor Parallelism、Pipeline Parallelism、序列并行

distributedtrainingTensor Parallelism
📝 Blog 2026-06
DeepSeek MLA

DeepSeek MLA

MLA
📦 Repo 2026-06
Needle

Needle

play
📦 Repo 2026-06
WeKnora

WeKnora

play

No bookmarks match your criteria.

Add Bookmark