LongSplat

Robust Unposed 3D Gaussian Splatting for Casual Long Videos

Chin-Yang Lin¹, Cheng Sun², Fu-En Yang², Min-Hung Chen², Yen-Yu Lin¹, Yu-Lun Liu¹

¹National Yang Ming Chiao Tung University, ²NVIDIA

ICCV 2025

Render novel view from casually captured long videos.

Abstract

LongSplat addresses critical challenges in novel view synthesis (NVS) from casually captured long videos characterized by irregular camera motion, unknown camera poses, and expansive scenes. Current methods often suffer from pose drift, inaccurate geometry initialization, and severe memory limitations. To address these issues, we introduce LongSplat, a robust unposed 3D Gaussian Splatting framework featuring: (1) Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians to avoid local minima and ensure global consistency; (2) a Pose Estimation Module leveraging learned 3D priors; and (3) an adaptive Octree Anchor Formation mechanism that dynamically adjusts anchor densities, significantly reducing memory usage. Extensive experiments on challenging benchmarks demonstrate that LongSplat achieves state-of-the-art results, substantially improving rendering quality, pose accuracy, and computational efficiency compared to prior approaches.

Video

Pipeline

Given a casually captured long video without known poses, LongSplat incrementally reconstructs the scene through tightly coupled pose estimation and 3D Gaussian Splatting. (a) Initialization converts MASt3R depth and correspondences into an octree-anchored 3DGS. (b) Global Optimization jointly refines all camera poses and Gaussians for global consistency. (c) Frame Insertion estimates each new frame pose via correspondence-guided PnP, updates octree anchors using unprojected points, and applies photometric refinement. If PnP fails, a fallback triggers global re-optimization to recover. (d) Incremental Optimization alternates between Local Optimization within a visibility-adapted window and periodic Global Optimization to propagate consistent updates across frames.

Octree Anchor Formation
Camera pose estimation
Visibility-Adapted Local Window

Given an initial sparse voxelized point cloud, we iteratively perform density-guided adaptive voxel splitting and pruning. Voxels with point cloud density (ρ) exceeding a threshold are split, while those with density below the threshold are pruned. Repeated across multiple octree levels, this adaptive octree anchor design significantly reduces memory usage, allowing efficient representation and rendering of large-scale scenes.

(a) PnP initialization: Given correspondences between the predicted 3D anchor points from frame T_i-1 and the 2D keypoints detected in frame T_i we employ PnP with RANSAC to robustly estimate an initial camera pose. (b) Pose refinement: The estimated pose is further refined by rasterizing the 3DGS scene and iteratively minimizing reprojection error to enhance pose accuracy. (c) Anchor unprojection: Newly observed regions are detected via an occlusion mask, computed by forward-warping the previous frame's rendered depth. These regions are unprojected into 3D and converted into anchors via Octree Anchor Formation.

To ensure balanced training of the 3D Gaussians, we dynamically define the optimization window based on anchor visibility overlap. Specifically, we compute the Intersection-over-Union (IoU) of visible anchors between consecutive views. Suppose the visibility IoU is below a certain threshold (a). In that case, the local optimization window is adjusted by removing the earliest frame, iteratively repeating until a suitable window with IoU above the threshold is found (b). This approach ensures balanced training coverage and enhances local reconstruction details during optimization (c).

Reconstruction Results

We visualize the reconstructed camera poses and pointmaps of LongSplat. Use the controls to switch between scenes.

Free - Grass

Free - Hydrant

Free - Lab

Free - Pillar

Free - Road

Free - Sky

Free - Stair

Hike - Forest1

Hike - Forest2

Hike - Forest3

Hike - Garden1

Hike - Garden2

Hike - Garden3

Hike - Indoor

Hike - Playground

Hike - University1

Hike - University2

Hike - University3

Hike - University4

Tanks and Temples - Barn

Tanks and Temples - Family

Tanks and Temples - Ignatius

Novel-view Synthesis Results

We visualize the comparison results of LongSplat with other methods. LongSplat outperforms other methods.

RGB Depth

Try selecting different methods and scenes!

Pose estimation results

We visualize the pose estimation results of LongSplat with other methods. LongSplat achieves the best pose accuracy.

Citation

@inproceedings{lin2025longsplat, title={LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos}, author={Chin-Yang Lin and Cheng Sun and Fu-En Yang and Min-Hung Chen and Yen-Yu Lin and Yu-Lun Liu}, booktitle={ICCV}, year={2025} }

Acknowledgements

This work was supported by NVIDIA Taiwan AI Research & Development Center (TRDC). This research was funded by the National Science and Technology Council, Taiwan, under Grants NSTC 112-2222-E-A49-004-MY2 and 113-2628-E-A49-023-. Yu-Lun Liu acknowledges the Yushan Young Fellow Program by the MOE in Taiwan.

Special thanks to Cookie, who contributed to part of the code implementation🐱.