Linearizing Large Language Models

May 14, 2024·

Jean Mercat

Igor Vasiljevic

Sedrick Keh

Kushal Arora

Achal Dave

Adrien Gaidon

Thomas Kollar

· 0 min read

PDF Cite Code Project DOI

Graphical representation of SUPRA linearization strategy.

Abstract

We propose Scalable UPtraining for Recurrent Attention (SUPRA), a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with minimal compute. This approach leverages pre-trained transformers’ strong performance while significantly reducing training costs. SUPRA demonstrates competitive performance on benchmarks but shows limitations in in-context learning and long-context tasks. Our code and models are available at .

Publication

In COLM (under review)

Last updated on Jun 28, 2024

← DataComp-LM: In Search of the Next Generation of Training Sets for Language Models Jun 17, 2024

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset Mar 25, 2024 →