What 1,228 Vision-Language-Action Papers Say About the Robot Data Problem

Physical AI June 22, 2026

Vision-Language-Action models are having their moment. A single model that sees a scene, understands an instruction, and outputs robot actions has attracted labs from every corner of AI. VLA publication volume grew 5× year-over-year in our analysis of the robotics research, making it the fastest-growing sub-field, in a field that itself grew 127% over the same period. The next-fastest area, manipulation and grasping, grew 1.6×.

We wanted to know what the VLA literature says about data: where it comes from, what it costs, what's annotated, and what's missing. We analyzed 1,228 VLA papers from arXiv's Robotics subdomain cs.RO (every paper with an explicit vision-language-action framing), spanning February 2023 through June 2026. We split the corpus into ten chunks and ran parallel analysis passes over each, pulling out named datasets and their scales, collection methods, annotation pipelines, verbatim bottleneck statements, synthetic data claims, and scaling results. Then we synthesized the findings across all ten.

The headline finding is one the field states itself hundreds of times: data, not architecture, not compute, is the binding constraint on VLA progress.

The more interesting finding is what kind of data problem it is. The field's self-diagnosis has shifted. The bottleneck is no longer "we don't have enough robot data." It's "the data we have is poorly annotated, dangerously homogeneous, untested, and unverified." That shift changes what the fix looks like.

High-quality robot data constrained by cost, workforce operations, and scalability

Read enough VLA papers and you notice that most of them open the same way. Some variant of this sentence appears as the motivating claim in an estimated 150–200+ of the 1,228 papers we analyzed: performance is fundamentally constrained by the availability of high-quality robot trajectory data, whose collection on real robots is costly, labor-intensive, and difficult to scale. That sentence is the standard first paragraph of the entire subfield.

The cost framing is everywhere. GigaBrain-0 (arXiv:2510.19430) argues that the inefficiency of physical data collection severely limits the scalability of current VLA systems. EgoVLA (arXiv:2507.12440) notes that the requirement for robot hardware fundamentally constrains data scale. Real2Render2Real (arXiv:2505.11917) calls teleoperation, still the prevailing collection paradigm, costly and constrained by manual effort and physical robot access.

Two survey papers go further. A 2026 survey of VLA datasets, benchmarks, and data engines (arXiv:2604.23001) argues that the central underexamined bottleneck is data infrastructure itself, and that future advances will depend less on model architecture and more on the co-design of high-fidelity data engines and structured evaluation protocols. The authors call for treating data infrastructure as a first-class research problem rather than a background concern.

Simply "collecting more" is no longer the answer

The most sophisticated recent papers don't argue for more data. They argue that naive scaling has stopped working.

A 2026 study on rethinking VLA scaling (arXiv:2602.09722) found that pooling heterogeneous robot datasets often induces negative transfer: adding data from other robots can make your model worse. MPVI (arXiv:2606.00985) documents long-horizon failures that persist despite finetuning on large teleoperated datasets, concluding that more data alone may not resolve the problem. A data-distillation result (arXiv:2511.16233) shows a curated 5% coreset recovering 85–90% of full-dataset performance, meaning most of the volume in current corpora is doing very little work.

The most practical result in this cluster comes from SEVO (arXiv:2605.11114), which found that a diversified collection protocol (varying lighting, backgrounds, and distractors during teleoperation) was the single most important factor for generalization, while in-distribution-only data produced near-zero transfer to new environments.

The pattern across these papers: scale helps for broad pretraining priors, but for a specific deployed capability, what matters is the composition of the data: its diversity, its annotation quality, its coverage of the hard cases. The field has exhausted the easy gains from volume.

Three data gaps highlighted by the research

We aggregated what papers identify as missing. Three gaps recurred across every chunk of the corpus. Each gap is stated plainly, and in two cases the papers quantify the cost of the gap and the payoff for closing it.

1. The language in vision-language-action is impoverished. Multiple independent papers diagnose the same disease: the language annotations in today's robot datasets are repetitive, template-like commands with limited structural variation (arXiv:2601.03136). One paper terms the resulting pathology "modality imbalance": language diversity is far lower than visual and action diversity, biasing models toward visual shortcuts (arXiv:2512.11218). Another calls it "Information Collapse": because instructions are predictable from the visual scene alone, the mutual information between instruction and action vanishes (arXiv:2601.15197).

The consequence is documented: state-of-the-art VLA models ignore their language input. LIBERO-PRO (arXiv:2510.03827) showed model outputs holding steady even when researchers corrupted instructions or replaced them with meaningless tokens.

The payoff for fixing it is documented just as clearly. LangGap (arXiv:2603.00592) showed targeted, diverse instruction augmentation moving single-task success from 0% to 90%. CAST (arXiv:2508.13446) added 27 percentage points on navigation through counterfactual language relabeling, with no new data collection at all. Relabeling existing trajectories with diverse, fine-grained language is one of the cheapest interventions available in robotics, and it delivers some of the largest measured gains.

2. Failure data gets collected, then thrown away. Researchers train VLA models on successful demonstrations and discard the rest. The failed attempts that occur during every collection session (slips, collisions, mis-grasps) never reach training. As VINE (arXiv:2512.03913) puts it, those failures encode where and how policies are fragile. The result: models with no corrective signal, where minor execution errors compound into unrecoverable, out-of-distribution states (arXiv:2605.08434).

The demand for failure-centric data is visible in the datasets now being built to supply it: RoboFAC ships 9,440 erroneous trajectories with 78,623 QA pairs (arXiv:2505.12224); FailSafe pairs failures with executable recovery actions (arXiv:2510.01642); ViFailback provides 58,000 failure-diagnosis VQA pairs (arXiv:2512.02787). These are rare exceptions against a success-only norm, which is why each one lands as a headline contribution.

3. Entire modalities are greenfield. The corpus is overwhelmingly RGB plus language plus proprioception. Tactile and force sensing, both essential for contact-rich manipulation, are nearly absent: one full chunk of 123 papers in our analysis contained zero tactile-centric work. Papers that do tackle it cite the same root cause: scarce aligned vision-tactile-language data (arXiv:2605.27886) and the absence of large multimodal datasets (arXiv:2507.17294). Audio is thinner still: three efforts in 1,228 papers. The few datasets that exist in these modalities (HapTile, ForceVLA, OmniVTLA, RoboOmni) get flagged as flagship contributions for one reason: the data doesn't otherwise exist.

The evaluation crisis underneath it all

A quieter crisis runs alongside the scarcity story: the field's benchmarks mislead systematically, and researchers are starting to admit it.

LIBERO is the universal evaluation substrate: by our count it appears in over 300 of the 1,228 papers. But a wave of diagnostic work shows that high LIBERO scores reflect memorization. LIBERO-Plus (arXiv:2510.13626) found success rates dropping from ~95% to below 30% under modest perturbations. LIBERO-PRO (arXiv:2510.03827) documented scores collapsing from 90%+ to 0.0%, attributing the gap to rote memorization of action sequences and environment layouts. Mechanistic evidence backs this up: sparse-autoencoder analysis (arXiv:2603.19183) found that the majority of features learned by VLA models fine-tuned on small datasets correspond to memorized training sequences.

Even the headline metric is suspect. MetaFine (arXiv:2605.19986) argues that collapsing capability into binary success rates inflates reported performance by up to 70%. Real-world evaluations typically rest on 25 or fewer rollouts without confidence intervals (arXiv:2605.29710), sample sizes too small to resolve the comparisons papers claim to make.

Security adds another layer. Two 2025 papers demonstrated data-poisoning attacks on VLA training: one achieved 98–99% backdoor success by corrupting just 0.31% of training episodes (arXiv:2510.10932). Uncurated data is a quality liability and an attack surface.

Where the field is heading

Faced with the bottleneck, the literature has split into three identifiable escape routes. Massive human-video corpora (HumanNet's one million hours per arXiv:2605.06747, Being-H0.5's 35,000 hours per arXiv:2601.12993) bet that abundant egocentric video can substitute for scarce robot data. Synthetic data engines are scaling fast: InternData-A1 (arXiv:2511.16651) claims the first synthetic-only corpus to match a strong real-robot dataset. Teleoperation is getting cheaper through commodity hardware and fleet-scale collection in which humans correct trajectories as they're recorded.

Each route is promising. Each one also converts the data problem rather than solving it. Human video is action-free and unstructured: someone has to segment, ground, and align it before it trains anything, and papers cite the absence of task-aligned annotations as the blocker (arXiv:2606.00054). Synthetic engines generate volume but lack failure coverage and produce "silent failures" that require verification. The strongest pipelines now build in VLM-based critics to filter their own output (arXiv:2604.09036), and those critics need human-verified ground truth to be trusted. The quality-over-quantity findings apply to all of it.

The trajectory of the field is clear. The frontier of VLA research is no longer "collect more demonstrations." It's making the data we have (and the data we generate) diverse, well-annotated, failure-aware, multimodal, and verified. Every one of those adjectives is an annotation and curation problem. The field's own surveys have started saying it plainly: data infrastructure is now a first-class research problem.

That's a conclusion drawn from 1,228 papers' worth of evidence, and it deserves more attention than it gets. Over the coming weeks we'll publish closer analyses of each gap, starting with the language-annotation problem, which has the strongest evidence and the cheapest fix in the entire corpus.

Methodology note: This analysis covers 1,228 papers with an explicit vision-language-action framing in title or abstract, drawn from arXiv cs.RO between February 2023 and June 2026. We identified the VLA corpus within a broader scan of 53,800 robotics papers, which is also the source of the field-level growth figures cited above. VLA wasn't cherry-picked; it emerged from a full-field analysis as the area growing fastest. We analyzed papers at the abstract level in ten parallel passes and synthesized findings centrally. Frequency estimates are conservative aggregations; dataset scales are as claimed by their authors and not independently verified. The central findings recur across all ten analysis chunks, which is why we consider them robust despite the abstract-level scope.