技术收藏夹

一些我在 LLM 推理优化领域读过、正在读、想读的论文、关注的开源项目和日常工具。

📄 Paper 2024-05

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-V2 提出 Multi-head Latent Attention (MLA) 和 DeepSeekMoE，实现高效推理

MoEDeepSeekMLALLM

📄 Paper 2024-12

DeepSeek-V3 Technical Report

DeepSeek-V3 671B 模型技术报告：FP8 训练、Multi-Token Prediction、负载均衡

MoEDeepSeekMulti-Token PredictionFP8

📄 Paper 2025-02

TurboQuant: LLM KV Cache Compression via Hadamard Transform & Lloyd-Max Quantization

ICLR 2026：用 Hadamard 旋转 + Lloyd-Max 最优标量量化实现 3-4 bit KV Cache 压缩

quantizationKV CacheTurboQuant

📄 Paper 2023-09

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

SOSP 2024：PagedAttention + vLLM 推理引擎的经典论文

vLLMPagedAttentionLLM Inference

📄 Paper 2023-12

SGLang: Efficient Execution of Structured Language Model Programs

RadicsAttention + 编译器优化驱动的 LLM 推理框架

SGLangLLM InferenceCompilation

📄 Paper 2025-01

Fast and Expressive LLM Inference with RadixAttention and SGLang

SGLang v0.4 技术报告：RadixAttention + Tree Attention + 编译器优化

SGLangRadixAttentionTree Attention

📄 Paper 2022-05

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

NeurIPS 2022：IO-aware 的精确注意力计算，tiling 和 recomputation 策略

AttentionCUDATiling

📄 Paper 2023-07

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

改进的 work partitioning 和并行策略，2× 加速

AttentionCUDATiling

📦 Repo 2023-06

vLLM

高性能 LLM 推理引擎，支持 PagedAttention、Continuous Batching、多种量化

vLLMInferenceCUDAPython

📦 Repo 2024-01

SGLang

结构化 LLM 程序执行引擎，RadixAttention + 编译器优化

SGLangInferencePythonCUDA

📦 Repo 2023-03

llama.cpp

C/C++ 实现的 LLM 推理，GGUF 格式支持，广泛用于边缘设备

GGMLInferenceC++Quantization

📦 Repo 2023-08

TensorRT-LLM

NVIDIA 的 LLM 推理优化平台，支持 FP8/INT4/INT8 量化

TensorRTNVIDIAInferenceC++

📦 Repo 2024-03

Transformer-Operator (torchao)

PyTorch 官方量化/剪枝/稀疏化库，支持 INT4/INT8/FP8 量化

PyTorchQuantizationCUDAPruning

🔧 Tool 2024-01

CUDA Toolkit Documentation

CUDA 官方文档，CUDA C++ Programming Guide 是必读

CUDANVIDIAGPU

🔧 Tool 2024-01

NVIDIA Nsight Compute

CUDA kernel 性能分析器，roofline 分析、occupancy 分析等

CUDAProfilingGPUNVIDIA

🔧 Tool 2024-01

Hugging Face LLM Leaderboard

开源 LLM 排行榜，跟踪各模型在 MMLU/GSM8K/HumanEval 上的表现

HuggingFaceBenchmarkLLM

📝 Blog 2024-10

PyTorch LLM Inference Optimization (官方教程)

PyTorch 团队关于 LLM 推理性能优化的综合指南

PyTorchInferenceOptimization

📄 Paper 2025-02

The Super Weight in LLM Quantization

SpinQuant / QuIP# 等工作中发现的 LLM 中有少数权重对量化特别敏感

quantizationLLMoutlier

📄 Paper 2022-11

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

ICML 2024：通过平滑 activation 中的 outlier 实现 W8A8 量化

quantizationW8A8LLM

📄 Paper 2023-06

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

MLSys 2024：基于 activation 分布选择重要权重的 INT4 量化

quantizationW4A16LLM

📄 Paper 2025-02

BlockLM: Block-wise Quantization for LLMs

逐块量化的新范式，在保持精度的前提下实现更高的压缩比

quantizationblock-wiseLLM

🔧 Tool 2023-03

NVIDIA H100 Tensor Core GPU Architecture

Hopper 架构白皮书：FP8 Tensor Core、Transformer Engine、DPX 指令

NVIDIAGPUH100FP8Tensor Core

📄 Paper 2019-09

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

NVIDIA 的分布式训练框架：Tensor Parallelism、Pipeline Parallelism、序列并行

distributedtrainingTensor Parallelism

📝 Blog 2026-06

DeepSeek MLA

MLA

📦 Repo 2026-06

Needle

play

📦 Repo 2026-06

WeKnora

play

No bookmarks match your criteria.

Add Bookmark