本书是推理工程领域的入门读物。在大规模高性能推理背后的每一项技术和技巧中,都有无尽的深度值得探索。
如果您正在寻找另一本书继续学习,我有三个推荐:
- 《AI Engineering: Building Applications with Foundation Models》,作者 Chip Huyen(O’Reilly Media,2025):这本极受欢迎的书介绍了 AI 工程主题的全貌。
- 《Build a Large Language Model (From Scratch)》,作者 Sebastian Raschka(Manning,2024):这本实践性强的书详细介绍了 LLM 架构。
- 《AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch》,作者 Chris Fregly(O’Reilly Media,2025):这本全新的书专注于性能构建。
AI 行业发展迅速,新模型、新研究和新实现不断涌现。我和我的同事们在 Baseten 博客上发布最新工作,您可以访问 Baseten 博客。
本章汇集了论文、文档、书籍和博客的列表,以进一步支持您作为推理工程师的后续学习。资源按主题组织,每个章节内按标题字母顺序排列。
8.1 架构
- Attention is All You Need,作者 Ashish Vaswani 等(神经信息处理系统,2017)
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,作者 Jacob Devlin 等(北美计算语言学协会,2019)
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,作者 Junnan Li 等(国际机器学习会议,2023)
- 《Deep Learning》,作者 Ian Goodfellow、Yoshua Bengio 和 Aaron Courville(MIT 出版社,2016)
- 《Deep Learning with Python(第2版)》,作者 François Chollet(Manning,2021)
- Denoising Diffusion Probabilistic Models,作者 Jonathan Ho 等(ArXiv abs/2006.11239,2020)
- DiT: Scalable Diffusion Models with Transformers,作者 William Peebles 和 Saining Xie(国际计算机视觉会议(ICCV),2022)
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO Awareness,作者 Tri Dao 等(ArXiv abs/2205.14135,2022)
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,作者 Tri Dao(ArXiv abs/2307.08691,2023)
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision,作者 Jay Shah 等(ArXiv abs/2407.08608,2024)
- Flash-Attention-4,作者 Tri Dao(Dao AI Research Lab,2025)
- Imagen Video: High Definition Video Generation with Diffusion Models,作者 Jonathan Ho 等(ArXiv abs/2210.02303,2022)
- Language Models Are Few-Shot Learners,作者 Tom Brown 等(ArXiv abs/2005.14165)
- Learning Transferable Visual Models from Natural Language Supervision,作者 Alec Radford 等(国际机器学习会议,2021)
- Longformer: The Long-Document Transformer,作者 Iz Beltagy 等(ArXiv abs/2004.05150,2020)
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces,作者 Albert Gu 和 Tri Dao(ArXiv abs/2312.00752,2023)
- Matryoshka Representation Learning,作者 Aditya Kusupati 等(神经信息处理系统,2022)
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer,作者 Noam Shazeer 等(ArXiv abs/1701.06538,2017)
- Reformer: The Efficient Transformer,作者 Nikita Kitaev 等(ArXiv abs/2001.04451,2020)
- Robust Speech Recognition via Large-Scale Weak Supervision,作者 Alec Radford 等(国际机器学习会议,2022)
- RoFormer: Enhanced Transformer with Rotary Position Embedding,作者 Jianlin Su 等(ArXiv abs/2104.09864,2021)
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis,作者 Dustin Podell 等(ArXiv abs/2307.01952,2023)
- Segment Anything,作者 Alexander Kirillov 等(2023 IEEE/CVF 国际计算机视觉会议(ICCV),2023)
- Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks,作者 Nils Reimers 和 Iryna Gurevych(ArXiv abs/1908.10084,2019)
- The Llama 3 Herd of Models,作者 Aaron Grattafiori 等(ArXiv 2407.21783,2024)
- Video Diffusion Models,作者 Jonathan Ho 等(ArXiv abs/2204.03458,2022)
- Visual Instruction Tuning,作者 Haotian Liu 等(ArXiv abs/2304.08485,2023)
8.2 开发者工具
- BitsAndBytes,作者 bitsandbytes-foundation
- ComfyUI,作者 comfyanonymous
- CUDA by Example: An Introduction to General Purpose GPU Programming,作者 Jason Sanders 和 Edward Kandrot(NVIDIA 开发者,2025)
- CUDA C++ Programming Guide Release 13.0,作者 NVIDIA(2025)
- CUDA cuBLAS Release 13.0,作者 NVIDIA(2025)
- CUTLASS,作者 NVIDIA
- DeepGEMM,作者 DeepSeek-ai
- Hugging Face Diffusers,作者 Hugging Face
- LMCache,作者 LMCache Project
- NVIDIA Dynamo Documentation,作者 NVIDIA(2025)
- NVIDIA Nsight Systems,作者 NVIDIA
- NVIDIA Triton Inference Server,作者 NVIDIA
- ONNX Runtime,作者 Microsoft
- PyTorch Performance Tuning Guide,作者 Szymon Migacz(PyTorch Foundation,2020)
- PyTorch Profiler,作者 Shivam Raikundalia(PyTorch Foundation,2021)
- SGLang Project,作者 LMSYS Org
- TensorRT Documentation,作者 NVIDIA(2025)
- TensorRT-LLM,作者 NVIDIA
- Transformers,作者 Hugging Face
- vLLM Project,作者 The Linux Foundation
8.3 前沿开源模型
- DeepSeek,作者 DeepSeek AI
- FLUX,作者 Black Forest Labs
- Gemma,作者 Google
- GLM,作者 Z.ai
- GPT OSS,作者 OpenAI
- Kimi,作者 Moonshot AI
- Llama,作者 Meta Llama
- MiniMax,作者 MiniMax AI
- Mistral,作者 Mistral AI
- Nemotron,作者 NVIDIA
- Orpheus,作者 Canopy Labs
- Qwen,作者阿里巴巴 Qwen
- Wan,作者 Wan-AI
- Whisper,作者 OpenAI
8.4 GPU 基础设施
- 《Designing Data-Intensive Applications》,作者 Martin Kleppmann(O’Reilly Media,2017)
- Grace Hopper / Grace Blackwell Systems,作者 NVIDIA
- GPU Glossary,作者 Frye 等(Modal,2025)
- InfiniBand,作者 NVIDIA
- Kubernetes Documentation,作者 The Kubernetes Authors(The Linux Foundation,2025)
- NVIDIA Blackwell Architecture Technical Brief: Built for the Age of AI Reasoning(NVIDIA,2025)
- NVIDIA H100 Tensor Core GPU Architecture: Exceptional Performance, Scalability and Security for the Data Center(NVIDIA,2023)
- NVIDIA Tesla: A Unified Graphics and Computing Architecture,作者 E. Lindholm 等(IEEE Micro,2008年3月至4月)
- NVLink / NVSwitch,作者 NVIDIA
- 《Programming Massively Parallel Processors: A Hands-on Approach》,作者 Wen-mei Hwu、David Kirk、Izzat El Hajj(Morgan Kaufmann,2022)
- SemiAnalysis,作者 Dylan Patel(SemiAnalysis,2025)
- 《Site Reliability Engineering: How Google Runs Production Services》,编辑 Betsy Beyer 等(O’Reilly Media,2017)
8.5 推理优化研究
- Adversarial Diffusion Distillation,作者 Axel Sauer 等(欧洲计算机视觉会议,2023)
- Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation,作者 Ofir Press、Noah Smith 和 Mike Lewis(ArXiv abs/2108.12409,2021)
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,作者 Song Han(MIT,2024)
- Cache-DIT,作者 Vipshop
- CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion,作者 Jiayi Yao 等(第二十届欧洲计算机系统会议论文集,2024)
- Adding Conditional Control to Text-to-Image Diffusion Models,作者 Lymin Zhang 等(国际计算机视觉会议,2023)
- Beyond the Buzz: A Pragmatic Take on Inference Disaggregation,作者 Tiyasa Mitra 等(ArXiv abs/2506.05508,2025)
- Break the Sequential Dependency of LLM Inference Using Lookahead Decoding,作者 Yichao Fu 等(ArXiv abs/2402.02057,2024)
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty,作者 Yuhui Li 等(ArXiv abs/2401.15077,2024)
- EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees,作者 Yuhui Li 等(自然语言处理实证方法会议,2024)
- EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test,作者 Yuhui Li 等(ArXiv abs/2503.01840,2025)
- FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving,作者 Ye 等(ArXiv abs/2501.01005,2025)
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,作者 Elias Frantar(ArXiv abs/2210.17323,2022)
- High-Resolution Image Synthesis with Latent Diffusion Models,作者 Robin Rombach 等(计算机视觉与模式识别会议(CVPR),2021)
- Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference,作者 Simian Luo(ArXiv abs/2310.04378,2023)
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale,作者 Tim Dettmers 等(ArXiv abs/2208.07339,2022)
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads,作者 Tianle Cai 等(ArXiv abs/2401.10774,2024)
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism,作者 Mohammad Shoeybi 等(ArXiv abs/1909.08053,2019)
- Efficient Memory Management for Large Language Model Serving with PagedAttention,作者 Woosuk Kwon 等(第29届操作系统原理研讨会论文集,2023)
- Fast Inference from Transformers via Speculative Decoding,作者 Yaniv Leviathan 等(国际机器学习会议,2022)
- Ring Attention with Blockwise Transformers for Near-Infinite Context,作者 Hao Lin 等(ArXiv abs/2310.01889,2023)
- SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration,作者 Jintao Zhang 等(ArXiv abs/2410.02367,2024)
- Sequence/Context Parallelism,作者 Megatron-LM for NVIDIA
- SmoothQuant,作者 Song Han(MIT)
- SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot,作者 Elias Frantar 和 Dan Alistarh(ArXiv abs/2301.00774,2023)
- SpecVLM: Fast Speculative Decoding in Vision-Language Models,作者 Haiduo Huang 等(ArXiv abs/2509.11815,2025)
- TeaCache: Efficient KV Cache Compression via Tensor Decomposition,作者 Feng Lu 等(阿里巴巴通义视觉智能实验室,ArXiv abs/2411.19108,2025)
8.6 智能评估
- ARC AGI Prize,作者 Greg Kamradt(2025)
- 《Evals for AI Engineers: Systematically Measuring and Improving AI Applications》,作者 Shreya Shankar 和 Hamel Husain(O’Reilly Media,即将出版,2026)
- Grade School Math: Training Verifiers to Solve Math Word Problems,作者 Karl Cobbe 和 Vineet Kosaraju(ArXiv abs/2110.14168,2021)
- How to Fine-Tune Qwen3 to GPT-4o Level Performance,作者 Greg Schoeninger(Fine-Tune Fridays,Oxen AI,2025)
- Humanity’s Last Exam,作者 Long Phan 等(AI 安全中心与 Scale AI,ArXiv abs/2501.14249,2025)
- HumanEval: Evaluating Large Language Models Trained on Code,作者 Michelle Pokrass、Qiming Yuan 和 Yichen Xu(OpenAI,2021)
- MMLU: Measuring Massive Multitask Language Understanding,作者 Dan Hendrycks 等(国际学习表征会议(ICLR)论文集,2021)
- MTEB: Massive Text Embedding Benchmark,作者 Niklas Muenninghoff 等(欧洲计算语言学协会会议,2022)
- SWE-Bench: Can Language Models Resolve Real-World Github Issues?,作者 Carlos Jimenez 等(国际学习表征会议(ICLR)论文集,2024)