LabVLA

Grounding Vision–Language–Action models in scientific laboratories.

Baochang Ren1, Xinjie Liu1, Xi Chen1, Yanshuo Liu1, Chenxi Li2, Daqi Gao1, Zeqin Su1 Jintao Xing1, Zirui Xue1, Rui Li2, Xiangyu Zhao1, Shuofei Qiao1†, Minting Pan2 Wangmeng Zuo3, Lei Bai2, Dongzhan Zhou2†, Ningyu Zhang1†, Huajun Chen1

1Zhejiang University 2Shanghai AI Laboratory 3Harbin Institute of Technology

Highlights

What sets LabVLA apart —data and policy

Building on this corpus together with broad real-robot pre-training data, we present LabVLA, a VLA pipeline for connecting written laboratory protocols to embodied robot execution in simulated scientific workspaces. LabVLA pairs protocol-conditioned data synthesis with FAST action-token pre-training and flow-matching post-training under a shared cross-embodiment schema.

FIG. 01System overview
LabVLA at a glance: external corpora warm-start the VLM, RoboGenesis synthesizes LabEmbodied-Data, and the policy is evaluated on four task families
01 · DataRoboGenesis: A Programmable Workflow and Data Engine

We introduce RoboGenesis, a simulation-based workflow and data engine that links environment construction, configured workflow generation, domain randomization, and success-filtered export to produce laboratory demonstrations that existing robot corpora rarely cover. We use this engine to synthesize LabEmbodied-Data, a corpus of multi-camera observations, language instructions, robot states, action trajectories, and structured annotations under a shared cross-embodiment schema.

02 · PolicyLabVLA Training Recipe

Specifically, LabVLA adapts a Qwen3-VL backbone to map visual observations, robot state, and language instructions into continuous action chunks through a DiT action expert. The model is trained in two stages: FAST action tokens first align the visual-language prefix with action semantics during VLM pre-training, and flow matching then predicts continuous robot actions during post-training. A knowledge-insulation design reduces interference between language-grounded VLM representations and the continuous-action expert during post-training.

0Validated lab scenes
0Robot platforms
0Annotation streams
0Best avg success (ID)
RoboGenesis

Multi-ArmData Engine

An end-to-end multi-arm data engine — spanning tasks, workflows, randomization, assets, and scene generation.

RoboGenesis 16 robot embodiments — 13 single-arm and 3 dual-arm
Atomic Skills

Multi-taskcapabilities

A suite of atomic manipulation tasks can be demonstrated independently or composed into complete lab workflows.

Open Door

UR5e

Close Door

UR16e

Pick

Festo

Stir

UR5e

Place

Rizon 4

Heat Liquid

FR3

Shake

Franka

Pour

Franka
Long-horizon

Workflows &dual-arm showcase

Atomic skills compose into long-horizon lab procedures across dual-arm manipulation, mobile navigation, and multi-step liquid handling.

Lift2

Lift2

Split Aloha

Split Aloha

Franka Mobile Navigation

Mobile

Franka 16-step Workflow

16-step
Robustness

Domainrandomization

We randomize scene appearance, camera viewpoint, object layout, obstacles, and tabletop conditions to improve robustness.

Reusable Assets

Assetgeneration

Generate reusable objects, containers, tools, and scene props for composing lab environments.

Reusable Assets
2947Total assets
16Categories
288Fine classes
Generated fume hoods asset 01
Fume Hoods 01
Generated fume hoods asset 02
Fume Hoods 02
Generated fume hoods asset 03
Fume Hoods 03
Generated fume hoods asset 04
Fume Hoods 04
Generated ball mills asset 01
Ball Mills 01
Generated ball mills asset 02
Ball Mills 02
Generated ball mills asset 03
Ball Mills 03
Generated ball mills asset 04
Ball Mills 04
Generated critical point dryers asset 01
Critical Point Dryers 01
Generated critical point dryers asset 02
Critical Point Dryers 02
Generated critical point dryers asset 03
Critical Point Dryers 03
Generated critical point dryers asset 04
Critical Point Dryers 04

Asset category distribution

Analytical Instruments476
Measuring326
Wet Chemistry Glassware319
Synthesis251
Furniture234
Biochemistry219
Heating202
Centrifugation178
Plastics166
Electrochemical161
Safety117
Metal Supports109
Gas Vacuum79
Ceramics67
Lab Supplies29
Tubing Connectors14

Top fine-grained asset classes

lab benches18 corrosive cabinets18 chemical cabinets18 fume hoods18 flammable cabinets18 wall shelves17 balance tables17 ion chromatography17
Composable Scenes

Scenegeneration

Generated scene preview Scene 01 / 04 Stage 07 / 07

From an empty room to a complete lab

The same viewpoint shows how room structure, furniture, equipment, assets, materials, safety cues, and task execution elements are added step by step.

Training Recipe

Two stages,one cross-embodiment policy

LabVLA adapts a Qwen3-VL backbone with FAST action-token pre-training, then a flow-matching DiT action expert — coupled by a stop-gradient that keeps language grounding intact.

FIG. 02Training recipe
LabVLA training recipe: grounded data pre-training with Qwen3-VL-4B-Instruct (VQA, language subtasks, FAST action tokens) and post-training with DiT action expert and knowledge insulation on LabEmbodied-Data
01VLM Pretraining

We first tokenize continuous actions with FAST and train the VLM under next token supervision, so the prefix learns to predict action tokens before the DiT is attached. In this stage we do not instantiate the DiT.

02Flow Matching Posttraining

The second stage therefore loads the VLM pretrained checkpoint, attaches the DiT action expert, and trains it with a flow matching objective that maps Gaussian noise to a clean action chunk through a deterministic vector field. At sampling time the deterministic vector field reaches a usable trajectory in only N=10 Euler steps, well below the hundreds needed by diffusion policies and fast enough for closed loop laboratory control.

03Knowledge Insulation

We therefore insulate the VLM from the flow loss while keeping the FAST and annotation token losses active, so the prefix can still learn from cross-entropy supervision without receiving velocity space gradients from the action expert. Knowledge insulation is a training time mechanism that blocks flow matching gradients from reaching the VLM prefix while FAST and annotation losses remain active.

Results

State-of-the-art onthe LabUtopia benchmark

Six laboratory operations under in-distribution (ID) and out-of-distribution (OOD) settings, compared against representative VLA baselines on LabUtopia.

TABLE 2 LabUtopia benchmark
Method Size Pick Up Press Button Open Door Pour Liquid Heat Beaker Transport Beaker Avg.
In-Distribution
SmolVLA<1B15.897.516.70.896.785.852.2
X-VLA<1B27.598.365.045.025.883.357.5
GR00T N1.53B40.899.26.7099.269.252.5
π03B21.792.551.637.590.086.763.3
π0.53B38.060.055.829.240.890.052.3
π0-FAST3B16.737.517.55.83.320.816.9
InternVLA-A13B25.893.338.32.582.567.551.7
Wall-oss-flow4B11.754.20.830029.216.0
LabVLA4B49.410065.043.383.385.871.1
0Average success · ID
0Average success · OOD
0over π0 (ID)
0over π0 (OOD)
Analysis

The data transfers,lifting external policies too

A study beyond LabUtopia: an external X-VLA baseline also benefits from fine-tuning on LabEmbodied-Data — the supervision is not tied to the LabVLA architecture.

05-task avg gain · ID
05-task avg gain · OOD
0Heat Beaker · ID
0Pour Liquid · OOD
TABLE 3LabEmbodied-Data transferability
Method Size Pick Up Open Door Pour Liquid Heat Beaker Transport Beaker Avg. Δ
In-Distribution
X-VLA<1B27.565.045.025.883.349.3
X-VLA + LabEmbodied<1B26.769.259.268.398.364.3+15.0
Out-of-Distribution
X-VLA<1B27.559.225.039.267.543.7
X-VLA + LabEmbodied<1B31.763.365.065.090.063.0+19.3

Five non-saturated LabUtopia tasks (Press Button excluded as near-saturated for all baselines). Δ is the change in five-task average from adding LabEmbodied-Data.

Real-World Validation

Simulation-trained,deployed on a real Franka

0Avg success · clean, in-domain
0Avg success · clean, OOD (best)
0Real-robot tasks
0Rollouts per condition
FIG. 03Real-world setup
Real-world Franka laboratory setup with beakers, flasks, a magnetic stirrer and a heating plate
TABLE 4Real-robot evaluation · Franka
Task Setting LabVLA (Ours) DreamZero π0.5
Shake LiquidIn-domain · Clean929092
In-domain · Cluttered868480
Out-of-domain · Clean848482
Out-of-domain · Cluttered808078
Pour LiquidIn-domain · Clean868882
In-domain · Cluttered788074
Out-of-domain · Clean767274
Out-of-domain · Cluttered727068
Magnetic StirIn-domain · Clean888688
In-domain · Cluttered808480
Out-of-domain · Clean807882
Out-of-domain · Cluttered748076
Stopper Plug / UnplugIn-domain · Clean808478
In-domain · Cluttered767672
Out-of-domain · Clean807870
Out-of-domain · Cluttered707264
AverageIn-domain · Clean86.587.085.0
In-domain · Cluttered80.081.076.5
Out-of-domain · Clean80.078.077.0
Out-of-domain · Cluttered74.075.571.5
Success rate (%) over 50 rollouts per setting; bold = per-row best. LabVLA leads the clean out-of-domain average.
Levels of embodied laboratory competence

From apprenticeto scientist

Rather than a single aggregate score, laboratory manipulation is better viewed through four levels of competence modeled on real laboratory roles. We position LabVLA at Level 2 (Technician), with RoboGenesis infrastructure that begins to support Level 3.

L1Apprentice

Level 1 (Apprentice) covers single step interactions with laboratory objects: grasping labware, pressing a button, opening a door, or placing a container.

L2 Technician

Level 2 (Technician) requires following a written multistep protocol through physical state changes such as pouring, heating, stirring, shaking, or transporting a vessel, where a failed earlier step cascades through the rest of the procedure.

LabVLA at Level 2 (Technician)

L3 Specialist

Level 3 (Specialist) adds operation of precision instruments (pipettes, centrifuges, thermal cyclers, microscopes) in longer workflows with measurement logging and safety constraints.

L4Scientist

Level 4 (Scientist) modifies the procedure in response to observations or measurements: adjusting concentrations, branching to alternative protocols, or deciding when an experimental objective has been met.

However, the policy does not yet demonstrate the instrument competence, measurement awareness, or scientific judgment that Level 3 and Level 4 require.

Affiliations

Institutions

The institutions behind LabVLA.

This work is jointly conducted by the following institutions