第 8 章推荐阅读 | Inference Engineering

本书是推理工程领域的入门读物。在大规模高性能推理背后的每一项技术和技巧中，都有无尽的深度值得探索。

如果您正在寻找另一本书继续学习，我有三个推荐：

《AI Engineering: Building Applications with Foundation Models》，作者 Chip Huyen（O’Reilly Media，2025）：这本极受欢迎的书介绍了 AI 工程主题的全貌。
《Build a Large Language Model (From Scratch)》，作者 Sebastian Raschka（Manning，2024）：这本实践性强的书详细介绍了 LLM 架构。
《AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch》，作者 Chris Fregly（O’Reilly Media，2025）：这本全新的书专注于性能构建。

AI 行业发展迅速，新模型、新研究和新实现不断涌现。我和我的同事们在 Baseten 博客上发布最新工作，您可以访问 Baseten 博客。

本章汇集了论文、文档、书籍和博客的列表，以进一步支持您作为推理工程师的后续学习。资源按主题组织，每个章节内按标题字母顺序排列。

8.1 架构

Attention is All You Need，作者 Ashish Vaswani 等（神经信息处理系统，2017）
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding，作者 Jacob Devlin 等（北美计算语言学协会，2019）
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models，作者 Junnan Li 等（国际机器学习会议，2023）
《Deep Learning》，作者 Ian Goodfellow、Yoshua Bengio 和 Aaron Courville（MIT 出版社，2016）
《Deep Learning with Python（第2版）》，作者 François Chollet（Manning，2021）
Denoising Diffusion Probabilistic Models，作者 Jonathan Ho 等（ArXiv abs/2006.11239，2020）
DiT: Scalable Diffusion Models with Transformers，作者 William Peebles 和 Saining Xie（国际计算机视觉会议（ICCV），2022）
FlashAttention: Fast and Memory-Efficient Exact Attention with IO Awareness，作者 Tri Dao 等（ArXiv abs/2205.14135，2022）
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning，作者 Tri Dao（ArXiv abs/2307.08691，2023）
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision，作者 Jay Shah 等（ArXiv abs/2407.08608，2024）
Flash-Attention-4，作者 Tri Dao（Dao AI Research Lab，2025）
Imagen Video: High Definition Video Generation with Diffusion Models，作者 Jonathan Ho 等（ArXiv abs/2210.02303，2022）
Language Models Are Few-Shot Learners，作者 Tom Brown 等（ArXiv abs/2005.14165）
Learning Transferable Visual Models from Natural Language Supervision，作者 Alec Radford 等（国际机器学习会议，2021）
Longformer: The Long-Document Transformer，作者 Iz Beltagy 等（ArXiv abs/2004.05150，2020）
Mamba: Linear-Time Sequence Modeling with Selective State Spaces，作者 Albert Gu 和 Tri Dao（ArXiv abs/2312.00752，2023）
Matryoshka Representation Learning，作者 Aditya Kusupati 等（神经信息处理系统，2022）
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer，作者 Noam Shazeer 等（ArXiv abs/1701.06538，2017）
Reformer: The Efficient Transformer，作者 Nikita Kitaev 等（ArXiv abs/2001.04451，2020）
Robust Speech Recognition via Large-Scale Weak Supervision，作者 Alec Radford 等（国际机器学习会议，2022）
RoFormer: Enhanced Transformer with Rotary Position Embedding，作者 Jianlin Su 等（ArXiv abs/2104.09864，2021）
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis，作者 Dustin Podell 等（ArXiv abs/2307.01952，2023）
Segment Anything，作者 Alexander Kirillov 等（2023 IEEE/CVF 国际计算机视觉会议（ICCV），2023）
Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks，作者 Nils Reimers 和 Iryna Gurevych（ArXiv abs/1908.10084，2019）
The Llama 3 Herd of Models，作者 Aaron Grattafiori 等（ArXiv 2407.21783，2024）
Video Diffusion Models，作者 Jonathan Ho 等（ArXiv abs/2204.03458，2022）
Visual Instruction Tuning，作者 Haotian Liu 等（ArXiv abs/2304.08485，2023）

8.2 开发者工具

BitsAndBytes，作者 bitsandbytes-foundation
ComfyUI，作者 comfyanonymous
CUDA by Example: An Introduction to General Purpose GPU Programming，作者 Jason Sanders 和 Edward Kandrot（NVIDIA 开发者，2025）
CUDA C++ Programming Guide Release 13.0，作者 NVIDIA（2025）
CUDA cuBLAS Release 13.0，作者 NVIDIA（2025）
CUTLASS，作者 NVIDIA
DeepGEMM，作者 DeepSeek-ai
Hugging Face Diffusers，作者 Hugging Face
LMCache，作者 LMCache Project
NVIDIA Dynamo Documentation，作者 NVIDIA（2025）
NVIDIA Nsight Systems，作者 NVIDIA
NVIDIA Triton Inference Server，作者 NVIDIA
ONNX Runtime，作者 Microsoft
PyTorch Performance Tuning Guide，作者 Szymon Migacz（PyTorch Foundation，2020）
PyTorch Profiler，作者 Shivam Raikundalia（PyTorch Foundation，2021）
SGLang Project，作者 LMSYS Org
TensorRT Documentation，作者 NVIDIA（2025）
TensorRT-LLM，作者 NVIDIA
Transformers，作者 Hugging Face
vLLM Project，作者 The Linux Foundation

8.3 前沿开源模型

DeepSeek，作者 DeepSeek AI
FLUX，作者 Black Forest Labs
Gemma，作者 Google
GLM，作者 Z.ai
GPT OSS，作者 OpenAI
Kimi，作者 Moonshot AI
Llama，作者 Meta Llama
MiniMax，作者 MiniMax AI
Mistral，作者 Mistral AI
Nemotron，作者 NVIDIA
Orpheus，作者 Canopy Labs
Qwen，作者阿里巴巴 Qwen
Wan，作者 Wan-AI
Whisper，作者 OpenAI

8.4 GPU 基础设施

《Designing Data-Intensive Applications》，作者 Martin Kleppmann（O’Reilly Media，2017）
Grace Hopper / Grace Blackwell Systems，作者 NVIDIA
GPU Glossary，作者 Frye 等（Modal，2025）
InfiniBand，作者 NVIDIA
Kubernetes Documentation，作者 The Kubernetes Authors（The Linux Foundation，2025）
NVIDIA Blackwell Architecture Technical Brief: Built for the Age of AI Reasoning（NVIDIA，2025）
NVIDIA H100 Tensor Core GPU Architecture: Exceptional Performance, Scalability and Security for the Data Center（NVIDIA，2023）
NVIDIA Tesla: A Unified Graphics and Computing Architecture，作者 E. Lindholm 等（IEEE Micro，2008年3月至4月）
NVLink / NVSwitch，作者 NVIDIA
《Programming Massively Parallel Processors: A Hands-on Approach》，作者 Wen-mei Hwu、David Kirk、Izzat El Hajj（Morgan Kaufmann，2022）
SemiAnalysis，作者 Dylan Patel（SemiAnalysis，2025）
《Site Reliability Engineering: How Google Runs Production Services》，编辑 Betsy Beyer 等（O’Reilly Media，2017）

8.5 推理优化研究

Adversarial Diffusion Distillation，作者 Axel Sauer 等（欧洲计算机视觉会议，2023）
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation，作者 Ofir Press、Noah Smith 和 Mike Lewis（ArXiv abs/2108.12409，2021）
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration，作者 Song Han（MIT，2024）
Cache-DIT，作者 Vipshop
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion，作者 Jiayi Yao 等（第二十届欧洲计算机系统会议论文集，2024）
Adding Conditional Control to Text-to-Image Diffusion Models，作者 Lymin Zhang 等（国际计算机视觉会议，2023）
Beyond the Buzz: A Pragmatic Take on Inference Disaggregation，作者 Tiyasa Mitra 等（ArXiv abs/2506.05508，2025）
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding，作者 Yichao Fu 等（ArXiv abs/2402.02057，2024）
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty，作者 Yuhui Li 等（ArXiv abs/2401.15077，2024）
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees，作者 Yuhui Li 等（自然语言处理实证方法会议，2024）
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test，作者 Yuhui Li 等（ArXiv abs/2503.01840，2025）
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving，作者 Ye 等（ArXiv abs/2501.01005，2025）
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers，作者 Elias Frantar（ArXiv abs/2210.17323，2022）
High-Resolution Image Synthesis with Latent Diffusion Models，作者 Robin Rombach 等（计算机视觉与模式识别会议（CVPR），2021）
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference，作者 Simian Luo（ArXiv abs/2310.04378，2023）
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale，作者 Tim Dettmers 等（ArXiv abs/2208.07339，2022）
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads，作者 Tianle Cai 等（ArXiv abs/2401.10774，2024）
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism，作者 Mohammad Shoeybi 等（ArXiv abs/1909.08053，2019）
Efficient Memory Management for Large Language Model Serving with PagedAttention，作者 Woosuk Kwon 等（第29届操作系统原理研讨会论文集，2023）
Fast Inference from Transformers via Speculative Decoding，作者 Yaniv Leviathan 等（国际机器学习会议，2022）
Ring Attention with Blockwise Transformers for Near-Infinite Context，作者 Hao Lin 等（ArXiv abs/2310.01889，2023）
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration，作者 Jintao Zhang 等（ArXiv abs/2410.02367，2024）
Sequence/Context Parallelism，作者 Megatron-LM for NVIDIA
SmoothQuant，作者 Song Han（MIT）
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot，作者 Elias Frantar 和 Dan Alistarh（ArXiv abs/2301.00774，2023）
SpecVLM: Fast Speculative Decoding in Vision-Language Models，作者 Haiduo Huang 等（ArXiv abs/2509.11815，2025）
TeaCache: Efficient KV Cache Compression via Tensor Decomposition，作者 Feng Lu 等（阿里巴巴通义视觉智能实验室，ArXiv abs/2411.19108，2025）

8.6 智能评估

ARC AGI Prize，作者 Greg Kamradt（2025）
《Evals for AI Engineers: Systematically Measuring and Improving AI Applications》，作者 Shreya Shankar 和 Hamel Husain（O’Reilly Media，即将出版，2026）
Grade School Math: Training Verifiers to Solve Math Word Problems，作者 Karl Cobbe 和 Vineet Kosaraju（ArXiv abs/2110.14168，2021）
How to Fine-Tune Qwen3 to GPT-4o Level Performance，作者 Greg Schoeninger（Fine-Tune Fridays，Oxen AI，2025）
Humanity’s Last Exam，作者 Long Phan 等（AI 安全中心与 Scale AI，ArXiv abs/2501.14249，2025）
HumanEval: Evaluating Large Language Models Trained on Code，作者 Michelle Pokrass、Qiming Yuan 和 Yichen Xu（OpenAI，2021）
MMLU: Measuring Massive Multitask Language Understanding，作者 Dan Hendrycks 等（国际学习表征会议（ICLR）论文集，2021）
MTEB: Massive Text Embedding Benchmark，作者 Niklas Muenninghoff 等（欧洲计算语言学协会会议，2022）
SWE-Bench: Can Language Models Resolve Real-World Github Issues?，作者 Carlos Jimenez 等（国际学习表征会议（ICLR）论文集，2024）