VLSA: Vision-Language-Action Models with
Plug-and-Play Safety Constraint Layer

1Department of Automation, Tsinghua University
2TetraBOT
3DAMO Academy, Alibaba Group
4Institute for Embodied Intelligence and Robotics, Tsinghua University
*Equal Contribution Corresponding Author
DAMO Academy Tsinghua University TetraBOT
Method Overview

Figure 1: Functional architecture of VLA and VLSA models.

Overview

Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in generalizing across diverse robotic manipulation tasks. However, deploying these models in unstructured environments remains challenging due to the critical need for simultaneous task compliance and safety assurance, particularly in preventing potential collisions during physical interactions.

Method Overview

Figure 2: Workflow of the AEGIS model.

In this work, we introduce a Vision-Language-Safe Action (VLSA) architecture, named AEGIS, which contains a plug-and-play safety constraint (SC) layer formulated via control barrier functions. AEGIS integrates directly with existing VLA models to improve safety with theoretical guarantees, while maintaining their original instruction-following performance. To evaluate the efficacy of our architecture, we construct a comprehensive safety-critical benchmark SafeLIBERO. Extensive experiments demonstrate that AEGIS achieves:

Collision Avoidance Rate
77.9% Translational Only
vs. Baseline 18.7%
68.9% Full Action Space
vs. Baseline 18.1%
Task Success Rate
68.1% Translational Only
vs. Baseline 50.9%
67.5% Full Action Space
vs. Baseline 57.8%
Execution Time Steps
262.3 Translational Only
vs. Baseline 278.2
257.6 Full Action Space
vs. Baseline 260.8

We summarize the main contributions as follows:

Simulation Demonstrations

We compare OpenVLA-OFT, $\pi_{0.5}$, and Ours across 32 scenarios on our constructed benchmark.

Method Overview

Note: To provide a comprehensive evaluation, we conduct experiments in both translational-only and full action space settings. We specifically include the translational setting given that SafeLIBERO tasks primarily involve top-down manipulation. This setting reduces action redundancy, allowing for a focused evaluation of positional collision avoidance capabilities.

SafeLIBERO-Spatial
Task 1: Pick up the black bowl between the plate and the ramekin and place it on the plate
Task 2: Pick up the black bowl on the stove and place it on the plate
SafeLIBERO-Goal
Task 1: Put the bowl on top of the cabinet
Task 2: Put the bowl on the plate
SafeLIBERO-Object
Task 1: Pick up the orange juice and place it in the basket
Task 2: Pick up the milk and place it in the basket
SafeLIBERO-Long
Task 1: Put the white mug on the left plate and put the yellow and white mug on the right plate
Task 2: Put both the alphabet soup and the cream cheese box in the basket

Real-World Demonstrations

In our real-world experiments, we employ pi05-DROID as the base VLA policy. The robotic platform and evaluation tasks are illustrated in the following figure.

Robot Platform

The platform consists of a 7-DoF Franka Emika Panda arm operated in joint velocity control mode, equipped with a Robotiq 2F-85 gripper. Perception is provided by an external ZED 2 stereo camera and a wrist-mounted ZED Mini stereo camera. The VLA policy infers at 15 Hz, while the low-level controller runs at 1 kHz. Experiments encompass two distinct tasks, each evaluated across two levels with varying obstacles.

Baseline

Task: Put the cup in the bowl.
Task: Put the apple in the basket.
Safety Success
(a) Level I
Safety Success
(b) Level II
Safety Success
(c) Level I
Safety Success
(d) Level II

Ours (AEGIS)

Task: Put the cup in the bowl.
Task: Put the apple in the basket.
Safety Success
(a) Level I
Safety Success
(b) Level II
Safety Success
(c) Level I
Safety Success
(d) Level II

BibTeX

@misc{hu2025vlsavisionlanguageactionmodelsplugandplay,
      title={VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer},
      author={Songqiao Hu and Zeyi Liu and Shuang Liu and Jun Cen and Zihan Meng and Xiao He},
      year={2025},
      eprint={2512.11891},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2512.11891},
}