Efficient Long Video Generation via Next-Frame-Rate Prediction

1Fudan University, 2TeleAI
TempoMaster first generates a video sequence at coarse and low frame rate to establish the global dynamics and semantic structure, and subsequently refines it by predicting frames at higher rates, thereby enhancing temporal smoothness and detail.

Abstract

We present TempoMaster, a novel framework that formulates long video generation as next-frame-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis. Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.

Next-Frame-Rate Prediction

Let's consider a frame sequence including T frames.

A typical bidirectional model encode and denoise all frames together.
It delivers a high temporal consistency in the cost of computational efficiency.

Autoregressive models predict each frame independently.
It's efficient but suffer from error accumulation and lack of long-range temporal coherence.

TempoMaster decouples global content planning from local detail refinement.
In this way, we can plan the consistent temporal structure with a coarse (low-frame-rate) frame sequences.
And the fine-grained details can be refined by predicting frames at higher rates in parallel efficiently.
The generated low-fps frames can be partitioned into multiple segments to enable parallel generation.

Let's Generate from 6fps to 24fps!

StageI: 6fps. The camera follows the F1 car as it accelerates through the track.

StageII: 12fps. The camera follows the F1 car as it accelerates through the track.

StageIII: 24fps. The camera follows the F1 car as it accelerates through the track.

Or we can skip the 12fps stage to speed up ...

StageI: 6fps. A couple of horses are runnning in the dirt.

StageII: 24fps. A couple of horses are runnning in the dirt.

More I2V results: (24fps, 500 frames)

A pink car drives along a winding snowy road, passing fluffy houses and cotton-candy trees as snowflakes gently fall. The camera follows the car from behind, panning slightly to reveal the whimsical landscape, with soft clouds drifting overhead and a warm glow illuminating the scene.

A hamster wearing sunglasses floats calmly in an orange life preserver on the ocean, its paws resting on the rim as gentle waves ripple around. The camera circles slowly, capturing the hamster’s relaxed expression and the shimmering water under a bright sky. Sunlight glints off the waves, highlighting the playful, serene moment as the hamster basks in the sun.

The video shows a woman in a black shirt and patterned apron standing at her kitchen stove, stirring a colorful dish of rice with vegetables in a frying pan. Her hands move steadily as she uses a wooden spoon to toss the ingredients, ensuring they are evenly mixed. The warm glow of the stovetop highlights the vibrant colors of the food, creating an inviting and homely cooking scene.

In a tranquil moment during the European Middle Ages, a man sits by the window, engrossed in playing his guitar. A few spectators around him quietly enjoy his performance, as the air is filled with the enchanting charm of music. The man sits on a stool at the window, elegantly resting one leg on the guitar, his eyes slightly closed, absorbed in the melody of the strings.

A child making a grimace walks eerily toward the camera, then bends down, squats, leans their head close to the frame, and locks eyes with the lens.

The animation style of Spider-Man: Into the Spider-Verse; in an office, a woman sits at her office desk, typing rapidly on a computer with a slight look of annoyance, with occasional glitch and flicker effects appearing on the screen.

The old man sits quietly at his desk, the letter in his hands trembling slightly. Sunlight filters through the shoji screen, illuminating the inked words as he gazes down, calm and composed.

The camera slowly zooms in, in front of the large window of a cyberpunk city on a rainy night, the woman sits with her head down and knees hugged, slowly raises her right hand, and her fingertips lightly touch her cheek. Neon lights outside the window flicker in the rain, and light and shadow are reflected on the ground.

A vintage blue car speeds along a dusty dirt road, kicking up a cloud of dust as it moves. The camera follows closely, capturing the motion and the warm glow of the setting sun. Train tracks and power lines stretch into the distance, adding an industrial backdrop to the rural scene. Trees line the road, framing the car as it drives forward, creating a dynamic and nostalgic atmosphere.

A woman with curly hair leans against a wooden railing, sipping from a green Heineken bottle. She smiles casually, her white crop top and black fanny pack adding to her relaxed style. The camera remains steady, capturing her in a warm, sunlit urban setting.

BibTeX

@misc{tempomaster2025,
      title={TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction}, 
      author={Yukuo Ma and Cong Liu and Junke Wang and Junqi Liu and Haibin Huang and Zuxuan Wu and Chi Zhang and Xuelong Li},
      year={2025},
      eprint={2511.12578},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.12578}, 
}