RedVLA: Physical Red Teaming
for Vision-Language-Action Models

Yuhao Zhang1,*, Borong Zhang1,*, Jiaming Fan1,*, Jiachen Shen1, Yishuai Cai1, Yaodong Yang1,†, Jiaming Ji1,†
1Institute for AI, Peking University
State Key Laboratory of General Artificial Intelligence, Peking University
*Equal contribution Corresponding author

Overview

Vision-Language-Action (VLA) models are advancing toward generalist robotic policies through unified end-to-end learning from vision and language to action. As their capabilities expand across increasingly important real-world domains, safety concerns have also grown substantially. Yet it remains unclear what physical safety risks these models may exhibit when deployed in the physical world, and how severe their consequences may be. Proactively identifying and mitigating such risks is therefore a crucial prerequisite for real-world deployment. This work asks: how can we proactively uncover the physical safety risks of Vision-Language-Action models?

RedVLA overview figure
RedVLA starts from a benign VLA setting, introduces an injected physical risk factor, and elicits unsafe behaviors across three safety cost types. The across six mainstream VLA models achieves the highest Attack Success Rate (ASR) of 95.5%.

In this work, we propose RedVLA, the first red teaming framework for physical safety in VLA models. We systematically elicit unsafe behaviors through a two-stage process: (I) Risk Scenario Synthesis constructs a valid and task-feasible initial risk scene. Specifically, it identifies critical interaction regions from benign trajectories and places the risk factor within these regions, aiming to entangle it with the VLA's execution flow and elicit a target unsafe behavior. (II) Trajectory-Driven Risk Amplification ensures stable elicitation across diverse models. It iteratively refines the risk factor state through gradient-free optimization guided by trajectory features.

95.5% Peak ASR on π0.5
6 Mainstream VLA models
≤10 Optimization iterations

What RedVLA Does

RedVLA studies physical safety risks in VLA models under benign instructions and scenes. It keeps the language input fixed and perturbs only the physical scene by injecting a risk object into the robot's execution flow.

The method is designed to preserve task feasibility and scene-instruction consistency while still exposing target unsafe behaviors.

Two-Stage Pipeline

Risk Scenario Synthesis constructs a valid and task-feasible initial risk scene by identifying critical interaction regions and instantiating a target safety violation with a corresponding risk object.

Trajectory-Driven Risk Amplification ensures stable elicitation across diverse models by refining the injected risk factor with trajectory-guided gradient-free optimization.

Key Findings

95.5% peak ASR 6 VLA models 10 iterations SimpleVLA-Guard

RedVLA reveals severe physical vulnerabilities in existing VLA policies and produces data that can be reused for deployment-time safety monitoring.

Framework

RedVLA framework diagram
The framework summarizes RedVLA as a two-stage pipeline. Risk Scenario Synthesis combines physical safety taxonomy, interaction identification, and risk instantiation to build a red teaming plan. Trajectory-Driven Risk Amplification then refines the injected risk factor to trigger the target unsafe behavior.

BibTeX

@misc{zhang2026redvlaphysicalredteaming,
    title={RedVLA: Physical Red Teaming for Vision-Language-Action Models}, 
    author={Yuhao Zhang and Borong Zhang and Jiaming Fan and Jiachen Shen and Yishuai Cai and Yaodong Yang and Jiaming Ji},
    year={2026},
    eprint={2604.22591},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2604.22591}, 
}