MoCha generates high-fidelity videos when conditioned on cartoon character reference images.
MoCha performs high-quality character replacement from the source video, ensuring strong consistency in lighting, animation, facial expression, and character identity in the output.
Controllable video character replacement with a user-provided one remains a challenging problem due to the lack of qualified paired-video data. Prior works have predominantly adopted a reconstruction-based paradigm reliant on per-frame masks and explicit structural guidance (e.g., pose, depth). This reliance, however, renders them fragile in complex scenarios involving occlusions, rare poses, character-object interactions, or complex illumination, often resulting in visual artifacts and temporal discontinuities.In this paper, we propose MoCha, a novel framework that bypasses these limitations, which requires only a single first-frame mask and re-renders the character by unifying different conditions into a single token stream.Further, MoCha adopts a condition-aware RoPE to support multi-reference images and variable-length video generation.To overcome the data bottleneck, we construct a comprehensive data synthesis pipeline to collect qualified paired-training videos. Extensive experiments show that our method substantially outperforms existing state-of-the-art approaches.
To start your own character replacement with MoCha, the following three inputs are required:
MoCha generates high-fidelity videos when conditioned on cartoon character reference images.
MoCha also performs well in replacing real-person characters in source videos.
Compared with existing works, MoCha can better preserve the lighting and color tone of the original video, making the character more naturally integrated into the new environment. Furthermore, MoCha can handle complex lighting conditions, such as shaking lights and strong backlighting.
MoCha can accurately replicate the actions and expressions of the original video, even in complex scenarios involving fast movements and object interactions. This ensures that the generated character video maintains high fidelity to the source performance.
@misc{orange2025mocha,
title={MoCha: End-to-End Video Character Replacement without Structural Guidance},
author={Orange Team},
year={2025},
url={https://orange-3dv-team.github.io/MoCha},
}