ReactVLA: Fast and Lightweight Reactive Robot Manipulation via Improved Mean-Flow Action Generation

Video

Abstract

While diffusion-based VLA policies excel at modeling expressive and multimodal action distributions, their reliance on iterative sampling introduces substantial inference latency, limiting their applicability to reactive closed-loop robot manipulation. To address this challenge, we propose ReactVLA, a low-latency and lightweight Vision-Language-Action framework for reactive robot manipulation. ReactVLA integrates two complementary components: (1) an improved Mean Flow (iMF) action generator that reduces expensive multi-step diffusion sampling to a one-to-two-step generation process, and (2) Attention Residuals (AttnRes), which replace uniform residual accumulation with dynamic depth-wise feature routing to better preserve multimodal representations.

We evaluate ReactVLA on both large-scale simulation benchmarks (LIBERO and RoboIMI) and real-world robotic manipulation tasks. Experimental results show that ReactVLA consistently outperforms similarly sized VLA baselines, including SmolVLA and π₀. On challenging precision manipulation tasks, ReactVLA achieves up to a 1.65× improvement in task performance while providing more than a 4× increase in inference speed compared with leading VLA models. These gains reduce policy latency to below 38.6 ms, enabling real-time reactive control for physical robot deployment.

Method Overview

Overview of the ReactVLA framework. Multimodal observations (visual inputs, language instructions, and proprioceptive states) are processed by an Attention Residual Transformer backbone with dynamic depth-wise feature routing. The proposed improved Mean-Flow (iMF) action head directly transports noisy actions toward the target action distribution through one-to-few-step generation, enabling ultra-low latency reactive robot control.

Experimental Results

LIBERO Benchmark

Method	# Params	Spatial	Object	Goal	Long	Avg.	Latency (ms)
Diffusion Policy	0.46B	78.3	92.5	68.3	50.5	72.4	178.82
OpenVLA	6.74B	84.7	88.4	79.2	53.7	76.5	115.0
π₀ (Paligemma-3B)	5.97B	87.0	63.0	89.0	48.0	71.8	94.3
π₀	4.03B	90.0	86.0	95.0	73.0	86.0	93.4
SmolVLA	0.45B	90.0	96.0	92.0	71.0	87.3	74.1
ReactVLA (Ours)	0.39B	93.0	95.0	92.0	72.0	88.0	18.3

RoboIMI Simulation

Method	Peg-in-Socket Reward	Object Transfer Reward	Latency (ms)
ACT	289.60	115.64	4.5
Diffusion Policy	1019.39	319.20	398.0
ReactVLA (Ours)	1513.56	526.22	15.1

Real-World Experiments (Diana 7 Robot)

Method	Orange P&P (Success)	Block Stack (Success)	Latency (ms)
SmolVLA	95% (19/20)	75% (15/20)	82.2
ReactVLA (Ours)	95% (19/20)	90% (18/20)	38.6

Qualitative Results

Qualitative rollout comparison on LIBERO benchmark.

LIBERO Benchmark. ReactVLA generates smoother trajectories and completes tasks faster than SmolVLA within the same physical time.

Qualitative rollout comparison on RoboIMI dual-arm tasks.

RoboIMI Dual-Arm Tasks. ReactVLA coordinates both robotic arms smoothly during peg-in-socket insertion and block handover compared to Diffusion Policy.

Real-world robot experiments on Diana 7.

Real-World Deployment. Orange Pick-and-Place and Block Stack tasks on the Diana 7 robotic arm with real-time reactive control.

BibTeX

@article{guo2025reactvla,
  title={ReactVLA: Fast and Lightweight Reactive Robot Manipulation via Improved Mean-Flow Action Generation},
  author={Guo, Yanzhao and Chen, Wenkai and Zhang, Jianwei},
  journal={arXiv preprint},
  year={2025}
}