🍊 Latent Atlas 🍉

❯

❯

❯

Fast Transformer Decoding

Fast Transformer Decoding

2026年6月01日1分钟阅读

source
paper
attention
mqa
kv-cache

基本信息

Title: Fast Transformer Decoding: One Write-Head is All You Need
Source type: paper
Related topic notes: Multi-Query Attention, KV Cache

TODO

阅读论文原文，整理 Multi-Query Attention 如何减少 autoregressive decoding 中的 K/V cache 写入和读取成本。
回填 MHA 与 MQA 在 KV heads、memory bandwidth 和质量取舍上的差异。
对照 GQA，梳理从 MQA 到 grouped sharing 的折中路线。

关系图谱

基本信息
TODO

反向链接

Attention
Grouped-Query Attention
Attention
Multi-Query Attention
Papers

🍊 Latent Atlas 🍉 · An AI knowledge atlas built with Quartz © 2026