TL;DR: A DINO-conditioned ControlNet for image-to-video diffusion that is appearance invariant offering semantic and structural control. We show that it can be a versatile tool for transfer and generative rendering tasks from 2D or 3D data, trading off spatial resolution for feature dimensions.


Appearance Decoupling

Simply training a ControlNet on DINO features results in them overfitting, the model learns to reproduce training domain appearance rather than following structural guidance. We study this by dropping first frame conditioning and relying solely on text prompts. Setting the prompt to "blue" reveals the bias: baseline models ignore the prompt and reconstruct original colors, while our appearance-augmented training significantly reduces domain overfitting.

Reconstruction comparison

Appearance Invariant DINO ControlNet

We train on simple photometric and neural style augmentations, yet the model generalizes to unseen styles and relighting conditions, bounded only by the base model's generative capacity. At inference, a style transferred first frame (via IC Light, InstantStyle, Krea, etc.) anchors appearance while DINO features guide structure. We have experimentally found that training the diffusion model parameters improves quality, but makes it more difficult to combine with other ControlNets (e.g. point cloud or other 3D memory). We release a checkpoint with a Control-LoRA mechanism in the codebase.

Training
Inference

Controlled Video Diffusion as Generative Rendering

We explore a range of 3D representations as conditioning sources for video generation. Representations like 3DGS more faithfully capture the feature field when rendered (despite minor artifacts) and result in more accurate reconstructions. Concerto provides very coarse features on the other hand, so the resulting generation approximately match the material (wood/cloth) but don't go beyond that in control and the geometry can significantly change. We find meshes and voxels to be of similar quality, the former does not provide very meaningful features and the latter suffers from aggregation results. All checkpoints have been finetuned from the base 2D model described above.

Generative Rendering

Interactive Results

Frame: 0 / 48

Additional Results