MoCha

MoCha performs high-quality character replacement from the source video, ensuring strong consistency in lighting, animation, facial expression, and character identity in the output.

Abstract

Controllable video character replacement with a user-provided one remains a challenging problem due to the lack of qualified paired-video data. Prior works have predominantly adopted a reconstruction-based paradigm reliant on per-frame masks and explicit structural guidance (e.g., pose, depth). This reliance, however, renders them fragile in complex scenarios involving occlusions, rare poses, character-object interactions, or complex illumination, often resulting in visual artifacts and temporal discontinuities.In this paper, we propose MoCha, a novel framework that bypasses these limitations, which requires only a single first-frame mask and re-renders the character by unifying different conditions into a single token stream.Further, MoCha adopts a condition-aware RoPE to support multi-reference images and variable-length video generation.To overcome the data bottleneck, we construct a comprehensive data synthesis pipeline to collect qualified paired-training videos. Extensive experiments show that our method substantially outperforms existing state-of-the-art approaches.

Getting Started with MoCha

To start your own character replacement with MoCha, the following three inputs are required:

Source Video: The original video with the character to be replaced.
Designation Mask for the First Frame: A mask marking the source character to be replaced in the first frame of Source Video.
Reference Images: Reference Images of the new character for replacement with clean background. We recommend uploading at least one high-quality, front-facing facial close-up.

Qualitative Performance

Cartoon Character Replacement

MoCha generates high-fidelity videos when conditioned on cartoon character reference images.

Real-Person Character Replacement

MoCha also performs well in replacing real-person characters in source videos.

Scene Illumination Consistency

Compared with existing works, MoCha can better preserve the lighting and color tone of the original video, making the character more naturally integrated into the new environment. Furthermore, MoCha can handle complex lighting conditions, such as shaking lights and strong backlighting.

Precise Action Preservation

MoCha can accurately replicate the actions and expressions of the original video, even in complex scenarios involving fast movements and object interactions. This ensures that the generated character video maintains high fidelity to the source performance.

BibTeX

@misc{orange2025mocha,
  title={MoCha: End-to-End Video Character Replacement without Structural Guidance}, 
  author={Orange Team},
  year={2025},
  url={https://orange-3dv-team.github.io/MoCha}, 
}