Yesnt: Are Diffusion Relighting Models Ready for Capture Stage Compositing?A Hybrid Alternative to Bridge the Gap

University of Bonn

[View on arXiv]

Abstract

Volumetric video relighting is essential for bringing captured performances into virtual worlds, but current approaches struggle to deliver temporally stable, production-ready results. Diffusion-based intrinsic decomposition methods show promise for single frames, yet suffer from stochastic noise and instability when extended to sequences, while video diffusion models remain constrained by memory and scale.

We propose a hybrid relighting framework that combines diffusion-derived material priors with temporal regularization and physically motivated rendering. Our method aggregates multiple stochastic estimates of per-frame material properties into temporally consistent shading components, using optical-flow-guided regularization. For indirect effects such as shadows and reflections, we extract a mesh proxy from Gaussian Opacity Fields and render it within a standard graphics pipeline. Experiments on real and synthetic captures show that this hybrid strategy achieves substantially more stable relighting across sequences than diffusion-only baselines, while scaling beyond the clip lengths feasible for video diffusion. These results indicate that hybrid approaches, which balance learned priors with physically grounded constraints, are a practical step toward production-ready volumetric video relighting.

Method Overview

Overview

We optimize a Gaussian Opacity Field from multi-view captures to render RGB, depth and normal maps for the novel views and extract a proxy mesh (left). Using a diffusion decomposition model we extract roughness and metallic maps, which we smooth using an optical-flow guided temporal regularization (top). We render the proxy geometry as a shadow caster in the 3d scene (bottom) and blend it with our screen space rendered image (right).

From right to left: rendered RGB frames, estimated albedo, normals, roughness, metallic, and our final relit output.

Temporal Optimization

Our method provides temporal regularization for dynamic relighting that mitigates flickering and appearance instability, preserving both fine details and smooth lighting transitions over time.

Comparison showing roughness estimates with temporal optimization (left) against unoptimized (right).

More Comparison Videos

Results

Relighting in Diverse Scenes

Our method effectively relights a variety of scenes, from indoor environments to outdoor settings, demonstrating its versatility and robustness across different lighting conditions and materials.

Relighting results on various scenes showcasing different lighting conditions and materials.

See Videos Individually

Comparison with Diffusion Models

Other methods (e.g. NVIDIA Diffusion Renderer, right) produce temporally inconsistent results, while our method (left) maintains temporal coherence and preserves fine details.

Comparison videos showing our method (left) against NVIDIA Diffusion Renderer (right).

More Comparison Videos

Video Gallery

See All Videos

Citation

@misc{jüttner2025yesntdiffusionrelightingmodels,
      title={Yesnt: Are Diffusion Relighting Models Ready for Capture Stage Compositing? A Hybrid Alternative to Bridge the Gap}, 
      author={Elisabeth Jüttner and Janelle Pfeifer and Leona Krath and Stefan Korfhage and Hannah Dröge and Matthias B. Hullin and Markus Plack},
      year={2025},
      eprint={2510.23494},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.23494}, 
}