LabVLA

Grounding Vision–Language–Action models in scientific laboratories

Baochang Ren¹, Xinjie Liu¹, Xi Chen¹, Yanshuo Liu¹, Chenxi Li², Daqi Gao¹, Zeqin Su¹ Jintao Xing¹, Zirui Xue¹, Rui Li², Xiangyu Zhao¹, Shuofei Qiao^1†, Minting Pan² Wangmeng Zuo³, Lei Bai², Dongzhan Zhou^2†, Ningyu Zhang^1†, Huajun Chen¹

¹Zhejiang University ²Shanghai AI Laboratory ³Harbin Institute of Technology

Paper Code Model Demo Contact

Highlights

What sets LabVLA apart —data and policy

Building on this corpus together with broad real-robot pre-training data, we present LabVLA, a VLA pipeline for connecting written laboratory protocols to embodied robot execution in simulated scientific workspaces. LabVLA pairs protocol-conditioned data synthesis with FAST action-token pre-training and flow-matching post-training under a shared cross-embodiment schema.

LabVLA at a glance: external corpora warm-start the VLM, RoboGenesis synthesizes LabEmbodied-Data, and the policy is evaluated on four task families

01 · DataRoboGenesis: A Programmable Workflow and Data Engine

We introduce RoboGenesis, a simulation-based workflow and data engine that links environment construction, configured workflow generation, domain randomization, and success-filtered export to produce laboratory demonstrations that existing robot corpora rarely cover. We use this engine to synthesize LabEmbodied-Data, a corpus of multi-camera observations, language instructions, robot states, action trajectories, and structured annotations under a shared cross-embodiment schema.

02 · TrainingLabVLA Training Recipe

LabVLA adapts a Qwen3-VL backbone to map visual observations, robot state, and language instructions into continuous action chunks through a DiT action expert. The model is trained in two stages: FAST action tokens first align the visual-language prefix with action semantics during VLM pre-training, and flow matching then predicts continuous robot actions during post-training. A knowledge-insulation design reduces interference between language-grounded VLM representations and the continuous-action expert during post-training.

0Validated lab scenes

0Robot platforms

0Annotation streams

0Best avg success (ID)

RoboGenesis

A Programmable Workflow and Data Engine

An end-to-end multi-arm data engine — spanning tasks, workflows, randomization, assets, and scene generation.

RoboGenesis 16 robot embodiments — 13 single-arm and 3 dual-arm

Atomic Skills

Multi-taskcapabilities

A suite of atomic manipulation tasks can be demonstrated independently or composed into complete lab workflows.

Open Door

UR5e

Close Door

UR16e

Pick

Festo

Stir

UR5e

Place

Rizon 4

Heat Liquid

FR3

Shake

Franka

Pour

Franka

Long-horizon

Workflows &dual-arm showcase

Atomic skills compose into long-horizon lab procedures across dual-arm manipulation, mobile navigation, and multi-step liquid handling.

Lift2

Split Aloha

Franka Mobile Navigation

Mobile

Franka 16-step Workflow

16-step

Robustness

Domainrandomization

We randomize scene layout, visual clutter, camera viewpoint, object appearance, lighting, and spatial placement to improve robustness.

Reusable Assets

Assetgeneration

Generate reusable objects, containers, tools, and scene props for composing lab environments.

Reusable Assets

2947Total assets

16Categories

288Fine classes

Generated fume hoods asset 01 — Fume Hoods 01

Generated fume hoods asset 02 — Fume Hoods 02

Generated fume hoods asset 03 — Fume Hoods 03

Generated fume hoods asset 04 — Fume Hoods 04

Generated ball mills asset 01 — Ball Mills 01

Generated ball mills asset 02 — Ball Mills 02

Generated ball mills asset 03 — Ball Mills 03

Generated ball mills asset 04 — Ball Mills 04

Generated critical point dryers asset 01 — Critical Point Dryers 01

Generated critical point dryers asset 02 — Critical Point Dryers 02

Generated critical point dryers asset 03 — Critical Point Dryers 03

Generated critical point dryers asset 04 — Critical Point Dryers 04

Asset category distribution

Analytical Instruments476

Measuring326

Wet Chemistry Glassware319

Synthesis251

Furniture234

Biochemistry219

Heating202

Centrifugation178

Plastics166

Electrochemical161

Safety117

Metal Supports109

Gas Vacuum79

Ceramics67

Lab Supplies29

Tubing Connectors14

Top fine-grained asset classes

lab benches18 corrosive cabinets18 chemical cabinets18 fume hoods18 flammable cabinets18 wall shelves17 balance tables17 ion chromatography17

Composable Scenes

Scenegeneration

Scene 01 / 04 Stage 07 / 07

From an empty room to a complete lab

The same viewpoint shows how room structure, furniture, equipment, assets, materials, safety cues, and task execution elements are added step by step.

Quantitative Comparison

Method	Metric 1	Metric 2	Metric 3	Average
Baseline 1	Value	Value	Value	Value
Baseline 2	Value	Value	Value	Value
Ours	Value	Value	Value	Value

Training Recipe

Two stages,one cross-embodiment policy

LabVLA adapts a Qwen3-VL backbone with FAST action-token pre-training, then a flow-matching DiT action expert — coupled by a stop-gradient that keeps language grounding intact.

LabVLA training recipe: grounded data pre-training with Qwen3-VL-4B-Instruct (VQA, language subtasks, FAST action tokens) and post-training with DiT action expert and knowledge insulation on LabEmbodied-Data

01VLM Pretraining

We first tokenize continuous actions with FAST and train the VLM under next token supervision, so the prefix learns to predict action tokens before the DiT is attached. In this stage we do not instantiate the DiT.

02Flow Matching Posttraining

The second stage loads the VLM pretrained checkpoint, attaches the DiT action expert, and trains it with a flow matching objective that maps Gaussian noise to a clean action chunk through a deterministic vector field. At sampling time the deterministic vector field reaches a usable trajectory in only N=10 Euler steps, well below the hundreds needed by diffusion policies and fast enough for closed loop laboratory control.

03Knowledge Insulation

We insulate the VLM from the flow loss while keeping the FAST and annotation token losses active, so the prefix can still learn from cross-entropy supervision without receiving velocity space gradients from the action expert. Knowledge insulation is a training time mechanism that blocks flow matching gradients from reaching the VLM prefix while FAST and annotation losses remain active.

Results

State-of-the-art onthe LabUtopia benchmark

Six laboratory operations under in-distribution (ID) and out-of-distribution (OOD) settings, compared against representative VLA baselines on LabUtopia.

TABLE 2 LabUtopia benchmark

Method	Size	Pick Up	Press Button	Open Door	Pour Liquid	Heat Beaker	Transport Beaker	Avg.
In-Distribution
SmolVLA	<1B	15.8	97.5	16.7	0.8	96.7	85.8	52.2
X-VLA	<1B	27.5	98.3	65.0	45.0	25.8	83.3	57.5
GR00T N1.5	3B	40.8	99.2	6.7	0	99.2	69.2	52.5
π0	3B	21.7	92.5	51.6	37.5	90.0	86.7	63.3
π0.5	3B	38.3	60.0	55.8	29.2	40.8	90.0	52.4
π0-FAST	3B	16.7	37.5	17.5	5.8	3.3	20.8	16.9
InternVLA-A1	3B	25.8	93.3	38.3	2.5	82.5	67.5	51.7
Wall-oss-flow	4B	11.7	54.2	0.83	0	0	29.2	16.0
LabVLA	4B	49.2	100	65.0	43.3	83.3	85.8	71.1
Out-of-Distribution
SmolVLA	<1B	11.7	99.2	18.3	1.67	98.3	89.2	53.1
X-VLA	<1B	27.5	99.2	59.2	25.0	39.2	67.5	52.9
GR00T N1.5	3B	33.3	92.5	8.3	0	99.2	66.7	50.0
π0	3B	19.2	89.1	53.3	38.3	90.8	88.3	63.2
π0.5	3B	30.0	68.3	59.2	29.2	40.0	85.8	52.1
π0-FAST	3B	14.2	45.0	15.8	7.5	11.7	24.2	19.7
InternVLA-A1	3B	19.2	95.8	63.3	0.83	84.2	57.5	53.5
Wall-oss-flow	4B	7.5	61.7	0	0	0	26.7	16.0
LabVLA	4B	48.3	98.3	65.8	34.2	87.5	85.8	70.0

0Average success · ID

0Average success · OOD

0over π0 (ID)

0over π0 (OOD)

Analysis

The data transfers,lifting external policies too

A study beyond LabUtopia: an external X-VLA baseline also benefits from fine-tuning on LabEmbodied-Data — the supervision is not tied to the LabVLA architecture.

05-task avg gain · ID

05-task avg gain · OOD

0Heat Beaker · ID

0Pour Liquid · OOD

TABLE 3LabEmbodied-Data transferability

Method	Size	Pick Up	Open Door	Pour Liquid	Heat Beaker	Transport Beaker	Avg.	Δ
In-Distribution
X-VLA	<1B	27.5	65.0	45.0	25.8	83.3	49.3	—
X-VLA + LabEmbodied	<1B	26.7	69.2	59.2	68.3	98.3	64.3	+15.0
Out-of-Distribution
X-VLA	<1B	27.5	59.2	25.0	39.2	67.5	43.7	—
X-VLA + LabEmbodied	<1B	31.7	63.3	65.0	65.0	90.0	63.0	+19.3

Five non-saturated LabUtopia tasks (Press Button excluded as near-saturated for all baselines). Δ is the change in five-task average from adding LabEmbodied-Data.

Real-World Validation

Real Franka experiments

0Avg success · clean, in-domain

0Avg success · clean, OOD (best)

0Real-robot tasks

0Rollouts per condition

FIG. 03Real-world setup

TABLE 4Real-robot evaluation · Franka

Task	Setting	LabVLA (Ours)	DreamZero	π0.5
Shake Liquid	In-domain · Clean	92	90	92
	In-domain · Cluttered	86	84	80
	Out-of-domain · Clean	84	84	82
	Out-of-domain · Cluttered	80	80	78
Pour Liquid	In-domain · Clean	86	88	82
	In-domain · Cluttered	78	80	74
	Out-of-domain · Clean	76	72	74
	Out-of-domain · Cluttered	72	70	68
Magnetic Stir	In-domain · Clean	88	86	88
	In-domain · Cluttered	80	84	80
	Out-of-domain · Clean	80	78	82
	Out-of-domain · Cluttered	74	80	76
Funnel Plug/ Unplug	In-domain · Clean	80	84	78
	In-domain · Cluttered	76	76	72
	Out-of-domain · Clean	80	78	70
	Out-of-domain · Cluttered	70	72	64
Average	In-domain · Clean	86.5	87.0	85.0
	In-domain · Cluttered	80.0	81.0	76.5
	Out-of-domain · Clean	80.0	78.0	77.0
	Out-of-domain · Cluttered	74.0	75.5	71.5

Success rate (%) over 50 rollouts per setting; bold = per-row best. LabVLA leads the clean out-of-domain average.

Field Rollouts

Uncut Franka execution clips from the real laboratory — one per evaluation task, sim-trained and deployed zero-shot.

Shake Liquid

92% · ID Clean

Pour Liquid

86% · ID Clean

Magnetic Stir

88% · ID Clean

Funnel Plug/Unplug

80% · ID Clean

Levels of embodied laboratory competence

From apprenticeto scientist

Rather than a single aggregate score, laboratory manipulation is better viewed through four levels of competence modeled on real laboratory roles. We position LabVLA at Level 2 (Technician), with RoboGenesis infrastructure that begins to support Level 3.

L1Apprentice

Level 1 (Apprentice) covers single step interactions with laboratory objects: grasping labware, pressing a button, opening a door, or placing a container.

L2 Technician

Level 2 (Technician) requires following a written multistep protocol through physical state changes such as pouring, heating, stirring, shaking, or transporting a vessel, where a failed earlier step cascades through the rest of the procedure.

LabVLA at Level 2 (Technician)

L3 Specialist

Level 3 (Specialist) adds operation of precision instruments (pipettes, centrifuges, thermal cyclers, microscopes) in longer workflows with measurement logging and safety constraints.

L4Scientist

Level 4 (Scientist) modifies the procedure in response to observations or measurements: adjusting concentrations, branching to alternative protocols, or deciding when an experimental objective has been met.

However, the policy does not yet demonstrate the instrument competence, measurement awareness, or scientific judgment that Level 3 and Level 4 require.

Affiliations

Institutions

The institutions behind LabVLA.

This work is jointly conducted by the following institutions

Page views …