The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing.
In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.
Given multiple keyframes as conditions, SmartDirector generates coherent videos with smooth transitions and consistent narratives across shots.
Given a single keyframe at any temporal position, SmartDirector generates a complete video. The keyframe thumbnail position indicates where it falls in the generated video timeline.
SmartDirector supports video-conditioned generation — forward continuation, backward generation, and in-between interpolation.
Using the same set of keyframes, SmartDirector can generate videos with different narrative pacing styles — from slow, suspenseful tension to fast-paced action sequences.
t=0.0s
t=3.0s
t=4.2s
t=0.0s
t=1.8s
t=3.4s
t=0.0s
t=1.0s
t=2.2sBy leveraging keyframe-conditioned generation, SmartDirector anchors identity information from reference frames during super-resolution, enabling identity-consistent restoration of degraded facial details and corrupted text — a capability beyond conventional SR methods.
The insertion condition images and videos used in these examples are sourced from publicly available channels or generated by models, and are intended solely to demonstrate the capabilities of this research. If there are any concerns, please contact us (wuhaoxue.whx@alibaba-inc.com) and we will remove the relevant examples in time.