🍊 Latent Atlas 🍉

❯

Inference

2026年3月29日1分钟阅读

大模型推理优化与部署，从解码链路、KV Cache、Attention 加速到量化、Serving、压缩和性能评测。

Inference Pipeline

Decoding — 自回归解码、采样、beam search 和投机解码。
KV Cache and Memory — KV Cache、PagedAttention、prefix cache 和显存管理。
Attention Acceleration — FlashAttention、FlashDecoding 和 attention kernel。

Deployment and Optimization

Quantization — weight-only、AWQ、GPTQ、FP8 和 KV Cache 量化。
Serving Systems — vLLM、continuous batching、请求调度和 PD 分离。
Compression — 模型压缩、剪枝、蒸馏和低秩压缩。
Performance — 延迟、吞吐、TTFT、TPOT 和 benchmark。

此文件夹下有7条笔记。

2026年4月25日
Performance
2026年4月18日
Compression
2026年4月12日
Serving Systems
2026年4月11日
Quantization
2026年4月05日
Attention Acceleration
2026年4月04日
KV Cache and Memory
2026年3月29日
Decoding

🍊 Latent Atlas 🍉 · An AI knowledge atlas built with Quartz © 2026