MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

Learning latent actions from diverse human videos enables scaling robot learning beyond embodiment-specific robot datasets, and these latent actions have been used as pseudo-action labels for vision-language-action (VLA) model pretraining. To make VLA pretraining effective, latent actions should contain information about the underlying agent actions despite missing ground-truth labels. We propose Multi-ViewPoint Latent Action Model (MVP-LAM), which learns discrete latent actions from time-synchronized multi-view videos. MVP-LAM trains latent actions with a cross-viewpoint reconstruction objective so that a latent action inferred from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM improves action-centricity, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on the SIMPLER and LIBERO-Long benchmarks.

MVP-LAM learns action-centric latent actions by training on time-synchronized multi-view videos with a cross-viewpoint reconstruction objective. Self-viewpoint reconstruction predicts $o_{t+1}^{v}$ from $(o_t^{v}, z_t^{v})$. Cross-viewpoint reconstruction swaps latent actions across synchronized views and predicts $o_{t+1}^{v}$ from $(o_t^{v}, z_t^{\tilde v})$ for $v \neq \tilde v$.

We measure action-centricity with mutual information between latent actions and ground-truth actions and with a linear probe that predicts actions from latent actions, reporting NMSE. MVP-LAM achieves the highest estimated $\mathcal{I}(Z;A)$ across estimators and the lowest NMSE on Bridge V2.

Pretraining with MVP-LAM latent actions improves downstream manipulation. The average success rate increases from 39.6 percent to 60.4 percent on SIMPLER. On LIBERO-Long, MVP-LAM reaches 90.8 percent success, improving over UniVLA pretrained on Bridge V2 at 79.4 percent.

Success rate and grasping rate in percent. Best is bolded and second best is underlined.

Success Rate	MVP-LAM	UniVLA	LAPA	OpenVLA	Octo-Small	Octo-Base	$\pi_0$
StackG2Y	33.3	16.7	54.2	41.6	8.3	0.0	37.5
Carrot2Plate	66.7	20.8	45.8	50.0	33.3	37.5	33.3
Spoon2Towel	66.7	54.2	70.8	37.5	25.0	12.5	29.2
Eggplant2Bask	75.0	66.7	58.3	16.7	12.5	20.8	45.8
AVG	60.4	39.6	57.3	36.4	19.8	17.7	36.5

MVP-LAM	UniVLA (Bridge)	OpenVLA	$\pi_0$	UniVLA (OXE)
90.8	79.4	53.7	85.2	92.0

We show example discrete codes selected for representative frame transitions. Similar motion patterns tend to activate similar codes across sources.

BibTeX

@misc{lee2026mvplamlearningactioncentriclatent,
  title     = {MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction},
  author    = {Jung Min Lee and Dohyeok Lee and Seokhun Ju and Taehyun Cho and Jin Woo Koo and Li Zhao and Sangwoo Hong and Jungwoo Lee},
  year      = {2026},
  eprint    = {2602.03668},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url       = {https://arxiv.org/abs/2602.03668}
}

MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

Cross-viewpoint reconstruction trains a latent action inferred from one view to explain the future in another view.

Abstract

Method

Experiments

RQ1. Are MVP-LAM latent actions more action-centric

RQ2. Is MVP-LAM effective for manipulation

SIMPLER benchmark

LIBERO-Long

Visualization

Rollouts

Stack green cube on yellow block

Place carrot on plate

Place spoon on towel

Place eggplant in basket

Put the black bowl in the bottom drawer of the cabinet and close it

Put both moka pots on the stove

Put the yellow and white mug in the microwave and close it

BibTeX