The State Space Model Revolution

A summary of the 2-hour long interview with Albert Gu, co-author of Mamba, on SSMs

Screenshot 2024-08-01 at 3.06.37 PM.png

We look towards developing new architectures that might even outperform the Transformer architecture, one of them being SSMs like Mamba
A refresher on Transformers:
- Self-attention is a major reason why transformers work so well. It enables an uncompressed view of the entire sequence with fast training.
- The downside? When generating the next token, we need to re-calculate the attention for the entire sequence, even if we already generated some tokens, meaning inference scales quadratically with context length
Mamba (discussed below) is closely related to RNNs (Recurrent Neural Networks) which scale linearly
- When generating the output, a RNN only needs to consider the previous hidden state and current input. It prevents recalculating all previous hidden states which is what a Transformer would do.
Sequence models are transformations from input sequence to output sequence

HiPPO (name inspired from hippocampus) took a very mathematical approach, which was only good for certain modalities (audio and video, closer to raw signals and more continuous in nature) but not as good for language
S4 (State Space Sequence Model) - highly efficient and expressive
- Recurrent State: Utilizes a recurrent state in its operations.
- Training Pass: During training, the recurrent state is rephrased as a convolution.
- State Concept: There is no explicit concept of state during the training pass, although it remains mathematically equivalent.
Mamba (A Time-Invariant Type of SSM)
- Compresses context or information into a state - stripping out unnecessary things
- SSMs: intelligent compression vs Attention: must remember everything
Gradually, it was simplified more and more until we could implement it efficiently

The goal has been to utilize the state as effectively as possible by focusing primarily on how you put information into the state and how you take information out
Turns out, the linear mechanism is all you need + a little bit of nonlinearity from the selection mechanism
- Older RNNs didn’t work because they were nonlinear (for optimization reasons and it would squash the state in unexpected ways)
- Initializing, defining, and parameterizing these models is much easier with linearity
Associative recall: this is where recurrent models fall short of attention because while it’s hard to retrieve the dictionary for recurrent models, Mamba 1 is pretty good at it and Mamba 2 is extremely good at it
Mamba 1
- Apply same linear transform to your state every time, [entire state vector] x [constant] to reduce the magnitude of everything in it
- Skip over time steps if necessary (i.e. discard useless ‘ums’ in audio) - this led to the selection mechanism
- Core SSM part doesn’t leverage tensor cores (which is what all modern hardware is specialized for due to the hardware lottery) → needed something practical to be worthwhile unless you want to build your own hardware which could take years