VLSA: Vision-Language-Action Models with
Plug-and-Play Safety Constraint Layer

1Department of Automation, Tsinghua University
2TetraBOT
3DAMO Academy, Alibaba Group
4Institute for Embodied Intelligence and Robotics, Tsinghua University
*Equal Contribution Corresponding Author
Tsinghua University TetraBOT DAMO Academy
Method Overview

Figure 1: Functional architecture of VLA and VLSA models.

Overview

Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in generalizing across diverse robotic manipulation tasks. However, deploying these models in unstructured environments remains challenging due to the critical need for simultaneous task compliance and safety assurance, particularly in preventing potential collisions during physical interactions.

Method Overview

Figure 2: Workflow of the AEGIS model.

In this work, we introduce a Vision-Language-Safe Action (VLSA) architecture, named AEGIS, which contains a plug-and-play safety constraint (SC) layer formulated via control barrier functions. AEGIS integrates directly with existing VLA models to improve safety with theoretical guarantees, while maintaining their original instruction-following performance. To evaluate the efficacy of our architecture, we construct a comprehensive safety-critical benchmark SafeLIBERO. Extensive experiments demonstrate that AEGIS achieves:

77.85%
Collision Avoidance Rate
vs. Baseline 18.69%
68.13%
Task Success Rate
vs. Baseline 50.88%
262.30
Execution Time Steps
vs. Baseline 278.24

We summarize the main contributions as follows:

Demonstrations

We compare OpenVLA-OFT, $\pi_{0.5}$, and Ours across 32 scenarios on our constructed benchmark.

Method Overview
SafeLIBERO-Spatial
Task 1: Pick up the black bowl between the plate and the ramekin and place it on the plate
Task 2: Pick up the black bowl on the stove and place it on the plate
SafeLIBERO-Goal
Task 1: Put the bowl on top of the cabinet
Task 2: Put the bowl on the plate
SafeLIBERO-Object
Task 1: Pick up the orange juice and place it in the basket
Task 2: Pick up the milk and place it in the basket
SafeLIBERO-Long
Task 1: Put the white mug on the left plate and put the yellow and white mug on the right plate
Task 2: Put both the alphabet soup and the cream cheese box in the basket

BibTeX

@misc{hu2025vlsavisionlanguageactionmodelsplugandplay,
      title={VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer},
      author={Songqiao Hu and Zeyi Liu and Shuang Liu and Jun Cen and Zihan Meng and Xiao He},
      year={2025},
      eprint={2512.11891},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2512.11891},
}