SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models

Abstract

Vision-language-action (VLA) models enable robots to follow natural-language instructions grounded in visual observations, but the instruction channel also introduces a critical vulnerability: small textual perturbations can alter downstream robot behavior. Systematic robustness evaluation therefore requires a black-box attacker that can generate minimal yet effective instruction edits across diverse VLA models.

To this end, we present SABER, an agent-centric approach for automatically generating instruction-based adversarial attacks on VLA models under bounded edit budgets. SABER uses a GRPO-trained ReAct attacker to generate small, plausible adversarial instruction edits using character-, token-, and prompt-level tools under a bounded edit budget that induces targeted behavioral degradation, including task failure, unnecessarily long execution, and increased constraint violations.

On the LIBERO benchmark across six state-of-the-art VLA models, SABER reduces task success by 20.6%, increases action-sequence length by 55%, and raises constraint violations by 33%, while requiring 21.1% fewer tool calls and 54.7% fewer character edits than strong GPT-based baselines. These results show that small, plausible instruction edits are sufficient to substantially degrade robot execution, and that an agentic black-box pipeline offers a practical, scalable, and adaptive approach for red-teaming robotic foundation models.

Method Overview

For each LIBERO task, we maintain two contrastive rollouts under a frozen target VLA. A clean baseline rollout is first executed and cached as reference. For the attack rollout, the instruction is passed to a red-team agent, which uses an LLM backbone to reason over the instruction and available tools, then performs multi-turn FIND→APPLY edits in a ReAct-style loop. The perturbation toolbox returns edited instructions from target positions and local context. The target VLA then executes the perturbed instruction to produce the attack rollout. The reward function compares the clean and attack rollouts, together with the agent's tool-use traces, to compute rewards from task outcome, action inflation, constraint violations, and stealth signals.

Perturbation Tool Families

SABER models adversarial instruction generation as multi-turn tool use over three complementary perturbation families, each following a two-stage FIND→APPLY protocol:

Token-Level

Edit words or subwords. FIND returns a tokenized sequence and a brief prompt for selecting the target token and edit type (replace, remove, add, or attribute swap); APPLY performs the edit using token index(es) and replacement text.

Character-Level

Apply typo-style edits within a word (insertion, deletion, substitution, transposition, case flip). Captures subword and OCR-like perturbations, e.g. pick→plck, or mug→rnug.

Prompt-Level

Inject clauses or sentences, such as verification wraps, decomposition steps, uncertainty clauses, extra constraints, or objective injections. APPLY inserts the clause under a maximum added-token budget.

Two-Stage Training Procedure

We cold-start by caching clean baseline rollouts from target VLAs and collecting initial attack trajectories with a frozen red-team agent via lightweight random exploration over tool-calling chains. These rollouts form the cold-start dataset for SFT before GRPO training. We then perform agentic RL in interactive scenarios, where the red-team agent attacks target VLAs through tool calling and learns from reward feedback computed by comparing clean and attack rollouts, together with the agent's tool-use traces, under different attack objectives.

Pretrained Checkpoints

We release the GRPO-trained LoRA adapters for all three attack objectives on HuggingFace. Each is a LoRA adapter (rank 8, ~75 MB) loadable with peft.

Objective	HuggingFace	Base Model
`task_failure`	IntelligenceLab/saber-attack-agent-task-failure	Qwen/Qwen2.5-3B-Instruct
`action_inflation`	IntelligenceLab/saber-attack-agent-action-inflation	Qwen/Qwen2.5-3B-Instruct
`constraint_violation`	IntelligenceLab/saber-attack-agent-constraint-violation	Qwen/Qwen2.5-3B-Instruct

Key Highlights

20.6%

Task Success Reduction

55%

Action Sequence Inflation

33%

Constraint Violation Increase

54.7%

Fewer Character Edits

Evaluated on the LIBERO manipulation benchmark across six state-of-the-art VLA models: π₀-LIBERO, π_0.5, GR00T-N1.5, X-VLA, InternVLA-M1, and DeepThinkVLA. SABER achieves stronger behavior-level attacks at lower cost than GPT-based baselines, requiring 21.1% fewer tool calls and 54.7% fewer character edits.

Experimental Results

Task Failure (ASR)

Attack success rate for task failure, computed as Base TER − Attack TER (%).

Victim VLA	LIBERO-Spatial			LIBERO-Object			LIBERO-Goal			LIBERO-Long			Overall
Victim VLA	Base TER↑	Atk TER↓	ASR↑	Base TER↑	Atk TER↓	ASR↑	Base TER↑	Atk TER↓	ASR↑	Base TER↑	Atk TER↓	ASR↑	Base TER↑	Atk TER↓	ASR↑
π₀-LIBERO	100.0	86.7	13.3	100.0	80.0	20.0	100.0	53.3	46.7	66.7	40.0	26.7	91.7	65.0	26.7
π_0.5	100.0	93.3	6.7	93.3	93.3	0.0	100.0	53.3	46.7	93.3	80.0	13.3	96.7	80.0	16.7
GR00T-N1.5	100.0	93.3	6.7	100.0	100.0	0.0	100.0	53.3	46.7	93.3	86.7	6.6	98.3	83.3	15.0
X-VLA	93.3	80.0	13.3	93.3	73.3	20.0	100.0	66.7	33.3	60.0	46.7	13.3	86.7	66.7	20.0

InternVLA-M1	93.3	86.7	6.6	100.0	93.3	6.7	100.0	46.7	53.3	86.7	73.3	13.4	95.0	75.0	20.0
DeepThinkVLA	86.7	80.0	6.7	100.0	93.3	6.7	100.0	33.3	66.7	93.3	73.3	20.0	95.0	70.0	25.0
Average	95.6	86.7	8.9	97.8	88.9	8.9	100.0	51.1	48.9	82.2	66.7	15.5	93.9	73.3	20.6

Action Inflation (AIR)

Action inflation ratio Δ|a| = |a_attack| / |a_base|.

Victim VLA	LIBERO-Spatial			LIBERO-Object			LIBERO-Goal			LIBERO-Long			Overall
Victim VLA	Base \|a\|	Atk \|a\|	AIR↑	Base \|a\|	Atk \|a\|	AIR↑	Base \|a\|	Atk \|a\|	AIR↑	Base \|a\|	Atk \|a\|	AIR↑	Base \|a\|	Atk \|a\|	AIR↑
π₀-LIBERO	119.3	220.7	1.85	139.2	233.9	1.68	101.3	230.0	2.27	363.0	457.4	1.26	180.7	285.5	1.58
π_0.5	112.2	173.9	1.55	151.1	226.7	1.50	105.1	196.5	1.87	346.5	391.5	1.13	178.7	247.2	1.38
GR00T-N1.5	133.7	514.7	3.85	129.7	220.5	1.70	98.3	378.5	3.85	343.5	346.9	1.01	176.3	365.2	2.07
X-VLA	157.8	189.4	1.20	189.7	187.8	0.99	126.5	261.9	2.07	431.3	504.6	1.17	226.3	285.9	1.26

InternVLA-M1	114.3	192.0	1.68	143.8	204.2	1.42	95.1	255.8	2.69	320.9	327.3	1.02	168.5	244.8	1.45
DeepThinkVLA	125.0	197.5	1.58	137.4	186.9	1.36	98.1	255.1	2.60	326.3	421.9	1.29	171.7	265.4	1.55
Average	127.0	248.0	1.95	148.5	210.0	1.44	104.1	263.0	2.56	355.2	408.3	1.15	183.7	282.3	1.55

Constraint Violation (CVI)

Constraint violation inflation ΔCV = CV_attack / CV_base.

Victim VLA	LIBERO-Spatial			LIBERO-Object			LIBERO-Goal			LIBERO-Long			Overall
Victim VLA	Base CV	Atk CV	CVI↑	Base CV	Atk CV	CVI↑	Base CV	Atk CV	CVI↑	Base CV	Atk CV	CVI↑	Base CV	Atk CV	CVI↑
π₀-LIBERO	550.7	326.2	0.59	711.5	838.3	1.18	309.6	624.5	2.02	1039.6	1269.1	1.22	652.9	764.5	1.17
π_0.5	570.9	549.8	0.96	681.4	699.0	1.03	260.3	439.1	1.69	863.9	1336.2	1.55	594.1	756.0	1.27
GR00T-N1.5	599.9	702.7	1.17	644.1	595.7	0.92	258.9	862.3	3.33	918.8	778.1	0.85	605.4	734.7	1.21
X-VLA	838.3	1356.3	1.62	828.3	725.7	0.88	347.1	885.0	2.55	1145.3	1171.0	1.02	789.8	1034.5	1.31

InternVLA-M1	475.9	130.3	0.27	639.3	493.0	0.77	232.9	495.3	2.13	681.9	1550.3	2.27	507.5	667.2	1.31
DeepThinkVLA	572.4	759.7	1.33	607.9	729.0	1.20	220.2	915.7	4.16	827.1	1509.7	1.83	556.9	978.5	1.76
Average	601.4	637.5	1.06	685.4	680.1	0.99	271.5	703.7	2.59	912.8	1269.1	1.39	617.8	822.6	1.33

Comparison and Ablation

Compared with a frozen GPT-5 mini attacker using the same tool-calling interface, SABER achieves consistent gains on objective-aligned metrics while being substantially more efficient and stealthy. RL training not only improves attack effectiveness, but also learns higher-leverage perturbation strategies.

SABER vs. GPT-5 mini

Method	Tool Calls↓	Char Edits↓	ASR↑	AIR↑	CVI↑
GPT-5 mini	3.93	168.8	14.5	1.37	1.25
SABER	3.10	76.46	16.7	1.38	1.27

Cold-Start Ablation

Training	Tool Calls↓	Char Edits↓	Base TER	Atk TER	ASR↑
GRPO Only	2.76	11.78	96.7	88.0	8.7
SFT + GRPO	3.10	76.46	96.7	80.0	16.7

Tool Usage by Objective

Mean tool calls and character edits per episode across LIBERO suites.

Objective	Spatial		Object		Goal		Long		Overall
Objective	Calls	Edits	Calls	Edits	Calls	Edits	Calls	Edits	Calls	Edits
Task Failure	2.97	13.4	3.34	13.2	2.80	10.3	2.97	15.1	3.02	13.0
Action Inflation	2.98	126.7	3.26	114.0	3.36	130.3	3.05	117.3	3.16	122.1
Constraint Violation	3.34	89.3	2.34	65.0	1.67	50.0	2.34	51.7	2.42	64.0

Contributions

Problem formulation: We identify the need for a general-purpose automated attacker for VLA systems and formulate instruction-only black-box attacks as a constrained optimization problem over robot behavioral objectives under bounded edit budgets.
Agentic attack methodology: We propose SABER, where a single GRPO-trained ReAct agent adaptively composes character-, token-, and prompt-level perturbations without gradient access to the target model or model-specific redesign.
Comprehensive evaluation: We evaluate on the LIBERO manipulation benchmark across six state-of-the-art VLA models and three attack objectives, showing average degradation of 20.6% in task success, a 55% increase in action-sequence length, and a 33% increase in constraint violations.
Efficiency over baselines: Compared with GPT-5 mini baselines, SABER achieves stronger behavior-level attacks at lower cost, requiring 21.1% fewer tool calls and 54.7% fewer character edits.

BibTeX

@article{wu2026saber,
  title={SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models},
  author={Wu, Xiyang and Shi, Guangyao and Wang, Qingzi and Li, Zongxia and Bedi, Amrit Singh and Manocha, Dinesh},
  journal={arXiv preprint arXiv:2603.24935},
  year={2026}
}