Real-Time Video World Modeling with Panoramic Gaussian Scaffold
Orange Team, Youku Moku-Lab, HUJING Digital Media & Entertainment Group
Given a single narrow-field-of-view image, MoVerse separates world construction from observation rendering. Stages I and II build a reusable panoramic 3D Gaussian scaffold offline; Stage III translates scaffold renderings along user-specified camera trajectories into photorealistic video at 8 FPS on a single RTX 4090.
Select a scene below and watch MoVerse turn a single input photograph into a free-roaming video walkthrough. The camera trajectory is user-controlled; the scaffold keeps geometry consistent across revisits, while the causal renderer streams temporally coherent frames in real time.
Stage I expands the input image into a gravity-aligned, horizontally periodic 360° panorama with topology-aware latent diffusion. The resulting panorama is the omnidirectional evidence that the 3D scaffold lifts.
Stage II lifts the panorama into a panoramic 3D Gaussian scaffold using feed-forward residual prediction in angular–inverse-depth space. The scaffold is a persistent, splattable scene asset and is what the video renderer in Stage III conditions on along the user-specified trajectory.
@article{moverse2026,
title = {MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold},
author = {Yang Zhou, Ziheng Wang, Yuqin Lu, Haofeng Liu, Jun Liang, Shengfeng He, and Jing Li},
journal = {arXiv preprint arXiv:2606.13376},
year = {2026}
}