A summary of the 2-hour long interview with Albert Gu, co-author of Mamba, on SSMs

Introduction
- We look towards developing new architectures that might even outperform the Transformer architecture, one of them being SSMs like Mamba
- A refresher on Transformers:
- Self-attention is a major reason why transformers work so well. It enables an uncompressed view of the entire sequence with fast training.
- The downside? When generating the next token, we need to re-calculate the attention for the entire sequence, even if we already generated some tokens, meaning inference scales quadratically with context length
- Mamba (discussed below) is closely related to RNNs (Recurrent Neural Networks) which scale linearly
- When generating the output, a RNN only needs to consider the previous hidden state and current input. It prevents recalculating all previous hidden states which is what a Transformer would do.
- Sequence models are transformations from input sequence to output sequence
A Brief History of SSMs
- HiPPO (name inspired from hippocampus) took a very mathematical approach, which was only good for certain modalities (audio and video, closer to raw signals and more continuous in nature) but not as good for language
- S4 (State Space Sequence Model) - highly efficient and expressive
- Recurrent State: Utilizes a recurrent state in its operations.
- Training Pass: During training, the recurrent state is rephrased as a convolution.
- State Concept: There is no explicit concept of state during the training pass, although it remains mathematically equivalent.
- Mamba (A Time-Invariant Type of SSM)
- Compresses context or information into a state - stripping out unnecessary things
- SSMs: intelligent compression vs Attention: must remember everything
- Gradually, it was simplified more and more until we could implement it efficiently
Mamba 1 vs Mamba 2
- The goal has been to utilize the state as effectively as possible by focusing primarily on how you put information into the state and how you take information out
- Turns out, the linear mechanism is all you need + a little bit of nonlinearity from the selection mechanism
- Older RNNs didn’t work because they were nonlinear (for optimization reasons and it would squash the state in unexpected ways)
- Initializing, defining, and parameterizing these models is much easier with linearity
- Associative recall: this is where recurrent models fall short of attention because while it’s hard to retrieve the dictionary for recurrent models, Mamba 1 is pretty good at it and Mamba 2 is extremely good at it
- Mamba 1
- Apply same linear transform to your state every time, [entire state vector] x [constant] to reduce the magnitude of everything in it
- Skip over time steps if necessary (i.e. discard useless ‘ums’ in audio) - this led to the selection mechanism
- Core SSM part doesn’t leverage tensor cores (which is what all modern hardware is specialized for due to the hardware lottery) → needed something practical to be worthwhile unless you want to build your own hardware which could take years