SmartDirector

Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

Zhida Zhang1, Jie Ma2, Zhan Peng3, Yang Han2, Haoxue Wu2, Jun Liang2, Jie Cao1, Jing Li2
1 New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences    2 Youku Moku-Lab    3 Huazhong University of Science and Technology
▶ HIGHLIGHT DEMO

ABSTRACT

The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing.

In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.

MULTI-KEYFRAME GENERATION

Given multiple keyframes as conditions, SmartDirector generates coherent videos with smooth transitions and consistent narratives across shots.

kf1 kf2 kf3 kf4 kf5 kf6 kf7
In a heartwarming stop-motion style, the video follows a straw-woven raccoon baker enjoying a cozy morning of baking. The narrative captures the process from kneading and cutting dough to placing it in an oven glowing with magical golden light.
kf1 kf2 kf3 kf4 kf5 kf6 kf7 kf8 kf9
3D claymation. The boy intensely plays games, and the dinosaur sweats nervously. Suddenly they win and jump up on the couch to celebrate. Potato chip crumbs explode and rain down like golden confetti.
kf1 kf2 kf3 kf4 kf5 kf6 kf7
Rendered in a delicate 2D fantasy style blending Studio Ghibli and Makoto Shinkai, the video depicts a boy falling through a cyan futuristic vertical city who touches a golden orb and is instantly engulfed by an explosive burst of light.
kf1 kf2 kf3 kf4 kf5 kf6 kf7 kf8
This stop-motion-inspired cinematic sequence depicts a heartwarming interaction between a red-hatted felt girl and a blue felt orca, progressing from a curious peep through a fisheye door viewer to the joyful exchange of a gift.
kf1 kf2 kf3 kf4 kf5
A young East Asian boy plays a first-person shooter game at home, moving from the living room to his bedroom while a female figure shifts from routine chores to silently observing him.
kf1 kf2 kf3 kf4
Pixar-style 3D animation, golden-hour outdoor scene: a hungry orange kitten hides in a tree, curiously spying on a house—first spotting pizza on a table through one window, then noticing a sleeping puppy through another.

SINGLE-FRAME GENERATION

Given a single keyframe at any temporal position, SmartDirector generates a complete video. The keyframe thumbnail position indicates where it falls in the generated video timeline.

keyframe
The video features 3 shots. Shot 1: A low-angle tracking shot follows a husky running. Shot 2: A hard cut to a side or rear chase angle. Shot 3: A hard cut to a long shot featuring a herd of elk in the foreground.
keyframe
The video features a single shot depicting a futuristic motorcycle performing a high-speed drift to evade pursuers on a rainy-night elevated highway in a futuristic city.
keyframe
The video consists of a single shot depicting the magnificent process of a wandering explorer climbing a cliff, removing his hood, and gazing at ancient ruins beneath the sunset in a desolate desert canyon.
keyframe
The video features a single shot depicting a white-robed scholar playing a bamboo flute on a raft in an ink-wash landscape, poetically attracting birds.
keyframe
The video features a single shot depicting a floating girl touching a magic book and awakening paper butterflies in a weightless, antique study.
keyframe
The video features a single shot depicting the cozy, healing atmosphere inside a moving pixel-art train, where snowy scenery rushes past the window and warm steam rises from a teacup.

VIDEO & MIXED-MODAL CONDITIONED GENERATION

SmartDirector supports video-conditioned generation — forward continuation, backward generation, and in-between interpolation.

Forward: Video → Future
Backward: Video → Past
In-Between Interpolation

NARRATIVE PACING CONTROL

Using the same set of keyframes, SmartDirector can generate videos with different narrative pacing styles — from slow, suspenseful tension to fast-paced action sequences.

kf1 kf2 kf3
▲ Shared Input Keyframes
SUSPENSE
This cyberpunk noir prompt depicts a tense rainy alley scene where slow, suffocating dolly-ins and hesitant push-ins amplify psychological dread as a vigilant woman transitions to resolve.
kf1t=0.0s
kf2t=3.0s
kf3t=4.2s
DOCUMENTARY
Adopting a neutral cyberpunk noir aesthetic, this prompt uses steady tracking and arcing shots to objectively observe a woman's shift from alertness to determination in a rainy alley.
kf1t=0.0s
kf2t=1.8s
kf3t=3.4s
ACTION
This high-intensity cyberpunk noir prompt captures a woman's aggressive movement through a rainy alley using rapid zooms, high-frequency camera shake, and sharp audio cues to maximize kinetic energy.
kf1t=0.0s
kf2t=1.0s
kf3t=2.2s

VIDEO SUPER-RESOLUTION

By leveraging keyframe-conditioned generation, SmartDirector anchors identity information from reference frames during super-resolution, enabling identity-consistent restoration of degraded facial details and corrupted text — a capability beyond conventional SR methods.

Original Ours
L: 0 / 0 100% R:
Drag to compare · Scroll to zoom · Right-click drag to pan
Original Ours
L: 0 / 0 100% R:
Drag to compare · Scroll to zoom · Right-click drag to pan

ETHICAL CONSIDERATIONS

The insertion condition images and videos used in these examples are sourced from publicly available channels or generated by models, and are intended solely to demonstrate the capabilities of this research. If there are any concerns, please contact us (wuhaoxue.whx@alibaba-inc.com) and we will remove the relevant examples in time.

CITATION

@article{zhang2026smartdirector, title = {SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control}, author = {Zhida, Zhang and Jie, Ma and Zhan, Peng and Haoxue, Wu and Yang, Han and Jun, Liang and Jie, Cao and Jing, Li}, journal = {arXiv preprint arXiv:2605.27891}, year = {2026} } }