MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

1Seoul National University , 2Microsoft Research Asia , 3Konkuk University , 4HodooAI Labs
Concept figure

Cross-viewpoint reconstruction trains a latent action inferred from one view to explain the future in another view.

Abstract

Learning latent actions from diverse human videos enables scaling robot learning beyond embodiment-specific robot datasets, and these latent actions have been used as pseudo-action labels for vision-language-action (VLA) model pretraining. To make VLA pretraining effective, latent actions should contain information about the underlying agent actions despite missing ground-truth labels. We propose Multi-ViewPoint Latent Action Model (MVP-LAM), which learns discrete latent actions from time-synchronized multi-view videos. MVP-LAM trains latent actions with a cross-viewpoint reconstruction objective so that a latent action inferred from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM improves action-centricity, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on the SIMPLER and LIBERO-Long benchmarks.

Method

MVP-LAM learns action-centric latent actions by training on time-synchronized multi-view videos with a cross-viewpoint reconstruction objective. Self-viewpoint reconstruction predicts $o_{t+1}^{v}$ from $(o_t^{v}, z_t^{v})$. Cross-viewpoint reconstruction swaps latent actions across synchronized views and predicts $o_{t+1}^{v}$ from $(o_t^{v}, z_t^{\tilde v})$ for $v \neq \tilde v$.

Architecture figure

Experiments

RQ1. Are MVP-LAM latent actions more action-centric

We measure action-centricity with mutual information between latent actions and ground-truth actions and with a linear probe that predicts actions from latent actions, reporting NMSE. MVP-LAM achieves the highest estimated $\mathcal{I}(Z;A)$ across estimators and the lowest NMSE on Bridge V2.

Mutual information and NMSE figure

RQ2. Is MVP-LAM effective for manipulation

Pretraining with MVP-LAM latent actions improves downstream manipulation. The average success rate increases from 39.6 percent to 60.4 percent on SIMPLER. On LIBERO-Long, MVP-LAM reaches 90.8 percent success, improving over UniVLA pretrained on Bridge V2 at 79.4 percent.

SIMPLER benchmark

Success rate and grasping rate in percent. Best is bolded and second best is underlined.

Success Rate MVP-LAM UniVLA LAPA OpenVLA Octo-Small Octo-Base $\pi_0$
StackG2Y33.316.754.241.68.30.037.5
Carrot2Plate66.720.845.850.033.337.533.3
Spoon2Towel66.754.270.837.525.012.529.2
Eggplant2Bask75.066.758.316.712.520.845.8
AVG60.439.657.336.419.817.736.5

LIBERO-Long

MVP-LAM UniVLA (Bridge) OpenVLA $\pi_0$ UniVLA (OXE)
90.8 79.4 53.7 85.2 92.0

Visualization

We show example discrete codes selected for representative frame transitions. Similar motion patterns tend to activate similar codes across sources.

Latent action visualization

Rollouts

Stack green cube on yellow block

MVP-LAMSuccess
UniVLAFail
Octo-BFail
$\pi_0$Fail

Place carrot on plate

MVP-LAMSuccess
UniVLAFail
Octo-BSuccess
$\pi_0$Success

Place spoon on towel

MVP-LAMSuccess
UniVLASuccess
Octo-BSuccess
$\pi_0$Fail

Place eggplant in basket

MVP-LAMSuccess
UniVLASuccess
Octo-BFail
$\pi_0$Success

Put the black bowl in the bottom drawer of the cabinet and close it

MVP-LAMSuccess
UniVLASuccess
$\pi_0$Fail

Put both moka pots on the stove

MVP-LAMSuccess
UniVLAFail
$\pi_0$Success

Put the yellow and white mug in the microwave and close it

MVP-LAMSuccess
UniVLAFail
$\pi_0$Success

BibTeX

@misc{lee2026mvplamlearningactioncentriclatent,
  title     = {MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction},
  author    = {Jung Min Lee and Dohyeok Lee and Seokhun Ju and Taehyun Cho and Jin Woo Koo and Li Zhao and Sangwoo Hong and Jungwoo Lee},
  year      = {2026},
  eprint    = {2602.03668},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url       = {https://arxiv.org/abs/2602.03668}
}