In the age of AI dominated by Transformer models, the core component—the attention mechanism—is the undisputed star. The \(O(N^2)\) complexity bottleneck in Transformer models poses a formidable hardware challenge. In practice, the linearly growing KV Cache of standard Softmax Attention is rapidly consuming the scarce and VERY EXPENSIVE ($$$) DRAM (Working Memory)[1]. This memory consumption model imposes a critical cost and capacity limit on the scalability of long-context models.
To solve this, Linear Attention provides an elegant mathematical solution. It not only reduces computational complexity to \(O(N)\), but, more critically, it achieves a fixed-size \(O(1)\) state.
This post explores the core mechanism of Linear Attention, deriving its state accumulation property and demonstrating how its constant memory bypasses the DRAM bottleneck, laying the foundation for economical and efficient ultra-long sequence generation.
Let's first recall the standard Softmax attention formula:
Here, \(Q\), \(K\), and \(V\) are the Query, Key, and Value matrices. The bottleneck lies in the \(QK^T\) matrix multiplication, which creates an \(N \times N\) attention score matrix. This single step is responsible for the \(O(N^2)\) complexity.
The core idea of Linear Attention is to "aggregate first, query later," cleverly avoiding the creation of this massive \(N \times N\) matrix. It achieves this by applying a feature map \(\phi(\cdot)\) (such as ReLU or ELU+1) to the Queries and Keys and then leveraging the associative property of matrix multiplication to change the order of computation.
While standard attention computes \((QK^T)V\), linear attention reorders it to \(Q(K^T V)\). Let's look at the formula:
At first glance, this still seems to require iterating through all \(j\) for each \(i\). But the key insight is that the term \(\sum_{j=1}^{N} (\phi(K_j) V_j^T)\) (a \(d \times d\) matrix) can be pre-computed and reused for all queries \(Q_i\). More importantly, in an autoregressive (causal) setting, this can be expressed as an accumulating state.
In Autoregressive Generation tasks, the model must produce a new token at each step, relying on all historical tokens (the context) to compute attention. To avoid recomputing Keys and Values for the entire historical sequence at every step, the KV Cache mechanism becomes an essential choice for handling long contexts.
However, the two attention mechanisms utilize fundamentally different KV Cache implementations, directly determining their capacity to handle ultra-long sequences.
For standard Softmax Attention, when generating the output \(O_t\) at position \(t\), the Query \(Q_t\) must compute the dot product against all past Keys \(K_{1:t}\) to produce the weights:
The KV Cache implementation, as shown on the left side of Figure 4, requires storing the entire history of Keys \(K_{1:t}\) and Values \(V_{1:t}\). With every step of generation, a new \(K_t\) and \(V_t\) must be appended to the Cache. Consequently, the KV Cache's memory size grows linearly with the sequence length \(N\) (or the current timestep \(t\)), resulting in an \(O(N)\) memory complexity. This massive, growing KV Cache quickly exhausts memory in ultra-long context settings, representing a major computational bottleneck as the Query \(Q_t\) must be computed against the entire expanding sequence.
Linear Attention completely avoids the need to store and re-access all historical Keys and Values through mathematical reordering:
As illustrated on the right side of Figure 4, the core of Linear Attention lies in defining a fixed-size hidden state \(S_i\) that compresses all historical Key-Value information into a single matrix. This state \(S_i\) is defined as the accumulated sum of the Key and Value outer products:
As seen in Figure 4, the new state \(S_i\) is obtained recursively, using only the previous state \(S_{i-1}\) and the current input \(K_i, V_i\):
The implementation achieves constant memory because the model does not need to store \(K_{1:t}\) and \(V_{1:t}\). Instead, it only needs to store and update a single, fixed-size matrix \(S\) (with dimension \(d \times d\)). The memory complexity is strictly \(O(1)\) (with respect to sequence length \(N\)), as the size of \(S\) depends exclusively on the model's hidden dimension \(d\). This guarantees that no matter how long the sequence becomes, the memory footprint remains constant, offering highly efficient inference where the Query \(Q_t\) only multiplies the fixed state \(S\).
i, we only need the cached state from the previous i-1 blocks (\(S_{1 \rightarrow i-1}\)) and the information from the current block (\(K_i, V_i\)).SANA-Video[2] leverages this theoretical elegance in long video generation, where a generated video consists of thousands of frames, making it an impossibly long sequence for traditional attention mechanisms.
By ultiziling the state accumulation property of linear attention, we can implement an efficient long video generation model with the following steps:
With this method, even when the model is generating the n-th block of the video, its attention calculation implicitly incorporates all historical information from the very first frame. This enables the model to maintain long-term temporal consistency, generating videos that are coherent and logically sound, without "forgetting" what happened at the beginning.
However, the above simple linear accumulation is just the beginning. In recent years, researchers have discovered that designing more sophisticated online update rules can significantly enhance model performance and expressive power. These methods transform the state update itself into a learnable, dynamic process.
The table below shows how several advanced models have proposed their own unique state update mechanisms:
From this table, we can see a clear progression:
These advanced update rules allow models to achieve far greater modeling capacity than simple linear accumulation, all while maintaining linear complexity. The "state" evolves from a simple "memory accumulator" into a complex, dynamic memory unit.
In this section, we introduce how state updating strategies are motivated and designed in LLMs.
While \(O(N)\) complexity is achieved, vanilla Linear Attention encounters a strict mathematical bottleneck in accurate Associative Retrieval tasks. Next-generation models actively employ gating and error correction to overcome this limitation.
The accumulated state \(S\) is the sum of outer products of all Key \(K_i\) and Value \(V_i\): \(S = \sum_{i=1}^{L} V_i K_i^T\).
Definition and Expansion of the Retrieval Operation:
The operation to retrieve \(V_j\) using \(K_j\) (as the query) is \(\hat{V}_j = S K_j\). Substituting the definition of \(S\):
Separation into Target Term and Interference Term:
We separate the sum into the target term (\(i=j\)) and the interference term (\(i \neq j\)):
Derivation of Perfect Retrieval Conditions:
For perfect retrieval (\(\hat{V}_j = V_j\)), the Interference Term must be zero. This requires the coefficients of all \(V_i\) to be zero:
This is the definition of mutual orthogonality. Therefore, the maximum number of interference-free memories is strictly limited by the Key/Value dimension \(d\) (Max \(L \le d\)). Once sequence length \(L\) exceeds \(d\), retrieval error is inevitable.
DeltaNet[3,4] introduces proactive error correction by treating state update as an online learning process that minimizes the Mean Squared Error (MSE).
Defining the Loss Function and Calculating the Gradient:
At each step \(t\), the state \(S\) minimizes the MSE loss \(L_t(S)\):
We calculate the gradient of the loss with respect to \(S\):
Deriving DeltaNet's Recurrent Formula:
Substituting the gradient into the gradient descent update rule \(S_t = S_{t-1} - \beta_t \nabla L_t(S_{t-1})\) (setting \(\eta_t = \beta_t\)):
Formula Expansion and Mechanism Separation:
Expanding and regrouping the terms, we derive DeltaNet's final recurrent formula:
Mechanism Summary: The matrix \((I - \beta_t K_t K_t^T)\) acts as a dynamic projection matrix. It utilizes the prediction error to correct the old state \(S_{t-1}\), enabling proactive control and elimination of the interference component along the direction of \(K_t\), thereby breaking the \(L>d\) retrieval limitation.
The practical success of LLM implementation relies on translating these theoretical mechanisms into hardware-efficient implementations and adopting modern architectural best practices.
Linear Attention represents a pivotal shift from the prohibitive \(O(N^2)\) approach to a scalable, hardware-conscious paradigm. By mathematically reordering the attention calculation, it enables the fundamental state accumulation property (\(S_i = S_{i-1} + \phi(K_i) V_i^T\)) that collapses memory usage from linear to constant \(O(1)\), directly bypassing the costly DRAM bottleneck.
Furthermore, advanced strategies like correction and gated recurrences successfully enhance this fixed state's modeling capacity, solving the crucial associative retrieval problem while preserving linear complexity. This elegant combination of constant memory efficiency and dynamic state capacity is the key to unlocking truly scalable and powerful generative models across long-form video, LLMs, and beyond.