Learning Multi-View Spatial Reasoning
from Cross-View Relations

1KAIST, 2Config, 3Hanyang University, 4Yonsei University, 5Seoul National University
*Equal contribution
XVR Teaser Figure

XVR is a large-scale dataset of 100K multi-view VQA samples designed to teach VLMs spatial reasoning across multiple views, spanning three fundamental tasks: Correspondence, Verification, and Localization.

Abstract

Vision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across different viewpoints. In this work, we introduce Cross-View Relations (XVR), a large-scale dataset designed to teach VLMs spatial reasoning across multiple views. XVR comprises 100K vision-question-answer samples derived from 18K diverse 3D scenes and 70K robotic manipulation trajectories, spanning three fundamental spatial reasoning tasks: Correspondence (matching objects across views), Verification (validating spatial relationships), and Localization (identifying object positions). VLMs fine-tuned on XVR achieve substantial improvements on established multi-view and robotic spatial reasoning benchmarks (MindCube and RoboSpatial). When integrated as backbones in Vision-Language-Action models, XVR-trained representations improve success rates on RoboCasa. Our results demonstrate that explicit training on cross-view spatial relations significantly enhances multi-view reasoning and transfers effectively to real-world robotic manipulation.

Cross-View Relation Tasks

We organize XVR into three task categories inspired by Structure-from-Motion (SfM), operationalized through eight specific tasks.

Correspondence identifies matching elements across views that represent the same physical entity, Verification checks whether multi-view observations are geometrically or temporally consistent, and Localization determines relative camera positions and which viewpoint corresponds to specific spatial conditions.

Point Directional Spatial Temporal Viewpoint Directional View Cross-Scenario Language-Conditioned
Point
Directional
Spatial
Temporal
Viewpoint
Directional View
Cross-Scenario
Language-Conditioned
1 / 8

Dataset Overview

XVR provides the highest mean images per sample (4.32) among training datasets, with supervision spanning both general and robotic domains.

Dataset Split # Imgs/sample Domain # Images # QAs
SpatialVLM Train, Eval 1.00 General 10M 2B
RoboSpatial Train, Eval 1.00 General 1M 3M
MindCube Train, Eval 3.37 General 3.2K 21K
MultiSPA Train, Eval 1.85 General 1.1M 27M
XVR (Ours) Train, Eval 4.32 General, Robotic 447K 103K

Experimental Results

XVR-Eval Benchmark

Qwen3-VL-2B-XVR achieves a 1.8× improvement over its base model and ranks first among all evaluated models, surpassing both open-source and closed-source alternatives. Notably, it exceeds human performance on Point Correspondence (94.32% vs. 92.31%, +2.01%).

XVR-Eval performance comparison

Generalization to External Benchmarks

Training on XVR improves Qwen3-VL-2B across all tasks on MindCube-Tiny and RoboSpatial-Home, with the largest gains in Compatibility (+7.6%) and Among (+7.0%). These improvements occur despite substantial distribution shifts, validating that cross-view relation reasoning captures general spatial principles.

MindCube-Tiny and RoboSpatial-Home results

Transfer to Embodied Tasks

We extend XVR-trained VLMs into Vision-Language-Action (VLA) models and evaluate on three RoboCasa manipulation tasks. XVR-trained models consistently improve manipulation performance: CoffeePressButton (83.1% → 89.3%, +6.2%), TurnOffMicrowave (45.7% → 72.7%, +27.0%), and PnPCabToCounter (31.1% → 38.8%, +7.7%). The largest gains occur on TurnOffMicrowave, where cross-view spatial disambiguation is most critical.

Three manipulation tasks and camera-view configurations
RoboCasa VLA Performance

BibTeX

TBD