VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Apr 21, 2026·

Jean Mercat

Sedrick Keh

Kushal Arora

Isabella Huang

Paarth Shah

Haruki Nishimura

Shun Iwase

Katherine Liu

· 0 min read

PDF Cite Code Project DOI Model Weights Documentation

VLA Foundry: an open-source framework for training vision-language-action models.

Abstract

We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM–>VLM–>VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at and all multi-task model weights are released on . Additional qualitative videos are available on the project website .

Publication

Technical Report, Toyota Research Institute

Last updated on Apr 27, 2026

A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation Apr 15, 2026 →