Dynamic Scene Reconstruction from Single Landscape Image Using 4D Gaussian in the Wild
In-Hwan Jin*, Haesoo Choo*, Seong-Hun Jeong, Park Heemoon, Junghwan Kim, Oh-joon Kwon, Kyeongbo Kong
*: Equal Contribution
Pukyong National University, Pusan National University, DMStudio
Overview video
Dynamic Scene Video
Recently, a field known as dynamic scene video has emerged, which creates videos with natural animations from specific camera perspectives using a combination of single image animation and 3D photography. These methods utilize Layered Depth Images (LDIs), which are created by dividing a single image into multiple layers based on depth, to represent a pseudo 3D space. However, there are limitations when attempting to discretely separate most elements, including fluids, in a continuous landscape, and 3D space cannot be fully represented this way. Therefore, achieving complete 4D space virtualization through explicit representation is necessary, and we propose this approach for the first time.
Abstract
Based on the outstanding performance of 3D Gaussian splatting, recent multi-view 3D modeling studies have expanded to 4D Gaussians. By jointly learning the temporal axis with 3D Gaussians, it is possible to reconstruct more realistic and immersive 4D scenes from multi-view landscape images. However, obtaining multi-view images that accurately reflect the overall motion in the wild is extremely challenging. In the dynamic scene video field, pseudo-3D representation methods combine with single-image animation techniques, which allows elements to move and render new scenes from different camera perspectives. Layered Depth Images (LDIs), a simplified 3D representation of separating a single image into depth-based layers, have limitations in reconstructing complex scenes, and artifacts can occur when continuous elements like fluids are separated into layers. Furthermore, due to its implicit modeling of 3D space, the output may be limited to videos in the 2D domain, potentially reducing their versatility. This paper proposes representing a complete 3D space for dynamic scene videos by modeling explicit representations, specifically 4D Gaussians, from a single image. The framework is focused on optimizing 3D Gaussians by generating multi-view images from a single image and creating 3D motion to optimize 4D Gaussians. A key aspect is consistent 3D motion estimation, which aligns common motion across multi-view images to bring 3D space motion closer to actual motions. Our model shows the ability to deliver realistic immersion in the wild landscape images through various experiments and metrics.
3D-MRM Framework
The overview of our pipeline.: Our goal is to refine 4D Gaussians to represent a complete 3D space, including animation, from a single image. (a) A depth map is estimated from the given single image, and it is converted into a point cloud. For optimizing the 3D Gaussians, multi-view RGB images are rendered according to the defined camera trajectory. (b) Similarly, multi-view motion masks are rendered for the input motion mask. These are utilized to estimate multi-view 2D motion maps along with the rendered RGB images. 3D motion is obtained by unprojecting the estimated 2D motion into the 3D domain. In this context, the proposed 3D Motion Refinement Module (3D-MRM) ensures consistent 3D motion across multi-views. (c) Using the refined 3D Gaussians and generated 3D motion, 4D Gaussians are refined for changes in position, rotation, and scaling over time.
3D Motion Refinement Module. To maintain consistency of motion across multi-views, 3D motion is defined from the point cloud and projected into 2D images using camera parameters. The L1 loss between the projected motion and the estimated motion map as the ground truth is computed, minimizing the sum of losses for multi-view to refine the 3D motion.
Quantitative Results
Results comparing our framework to the previous dynamic scene video model: Our approach outperforms the other baseline on all metrics in the context of view generation.
In particular, our method achieved the highest scores in PSNR, SSIM, and LPIPS, indicating that the generated views are of high fidelity and perceptually similar to the ground truth views.
Additionally, we applied motion masks to the animated regions to compare the results.
This shows that the quality of animation for designated areas in the 4D Gaussians rendered videos was realistic and of high fidelity.
Visualization Results
We present qualitative comparison results with other baseline methods. In this case, our proposed model, as an explicit representation, is projected to 2D video for comparison of results.
The process of separating the input image into LDIs in 3D-Cinemagraphy leads to artifacts on animated regions and fails to provide natural motion which results in reduced realism.
Similarly, Make-It-4D also utilizes LDIs to represent 3D for multi-view generation, which results in lower visual quality.
Additionally, due to unclear layer separation, objects appear fragmented or exhibit ghosting effects, where objects seem to leave behind afterimages.
In contrast, the proposed model represents a complete 3D space with animations, providing less visual artifact and high rendering quality from various camera viewpoints.
Therefore, our method provides more photorealistic results compared to others for various input images.
Ablation Study
1) 3D Motion Optimization Module
1) 3D Motion Optimization Module: Independently estimated 2D motion from multi-view images can result in different motion values for the same region in 3D space.
Directly using these 2D motions to animate viewpoint videos can fail to train 4D Gaussians to represent natural motion.
Table II shows the results of EPE between multi-view flows with and without the 3D Motion Optimization Module.
The estimated motions are projected to the center point through a depth map for the same position measurement.
The results indicate that without 3D Motion Optimization, estimated flows are significantly different for the same positions, while our 3D motion demonstrates outstanding performance in consistency across entire viewpoints with almost no variance.
(a) shows the visualized results of 2D motion and projected 3D motion.
Similarly, it indicates that 3D motion represents motion information in 3D space, which ensures consistency when projected to different viewpoints.
We animated viewpoint videos using each motion and trained 4D Gaussians on multi-view videos.
However, as shown in (b), the rendered video of 4D Gaussians and estimated optical flow have the lack of motion consistency in the viewpoint videos caused unnatural movements.
2) Single image animation model
2) Single image animation model: In our model, it is crucial to utilize a single image animation model that generates viewpoint videos by accurately reflecting 3D motion to train 4D Gaussians.
Figure shows the results of trained 4D Gaussians using animated videos by two single image animation models, SLR-SFS and Text2Cinemagraph.
Comparing the rendered 4D Gaussians, we observe that the motion is trained differently depending on the single image animation model.
Additionally, training with viewpoint videos animated by SLR-SFS, our model produced better results.
3) Effect of 3D motion initialization
3) Effect of 3D motion initialization: To verify the importance of 3D motion in training 4D Gaussians, we compared the results of our method with and without 3D motion initialization.
When applying animation to fluids, repeated patterns occurred. In Table III, it shows EPE score of estimated optical flow from each rendered video.
It indicates that the explicit representation of 4D Gaussians, which train multi-views and motion jointly, finds it difficult to capture the overall motion
accurately when trained only with viewpoint videos.
Figure shows the estimated optical flow from the rendered video of 4D Gaussians. This demonstrates that it is difficult to accurately learn motion without 3D motion initialization.
In contrast, our method shows that it can learn the overall 3D motion.
4) Effect of Two-stage training
4) Effect of Two-stage training: To achieve faster and more stable results with our algorithm, we separated the 4D Gaussians learning process by viewpoints and time axis.
In step 1, we trained 3D Gaussians using all viewpoints, and in step 2, we trained 4D Gaussians using videos from sampled view-points.
Figure shows the results of training 4D Gaussians with animated videos for all viewpoints, while the bottom shows the results of our two-stage training approach trained on only three viewpoint videos.
This demonstrates that our training method produces results almost identical to those obtained by training with videos from all viewpoints.
Additionally, as shown in Table IV , which was evaluated on a sample validation set, our method not only maintains high performance but also achieves a significant efficiency improvement.
It is over 30 times faster in generating videos and requires about one-third less time to train the 4D Gaussians, demonstrating an optimal balance between speed and accuracy.