icon LongSplat

Robust Unposed 3D Gaussian Splatting for Casual Long Videos

1National Yang Ming Chiao Tung University, 2NVIDIA

ICCV 2025

Render novel view from casually captured long videos.

Abstract


LongSplat addresses critical challenges in novel view synthesis (NVS) from casually captured long videos characterized by irregular camera motion, unknown camera poses, and expansive scenes. Current methods often suffer from pose drift, inaccurate geometry initialization, and severe memory limitations. To address these issues, we introduce LongSplat, a robust unposed 3D Gaussian Splatting framework featuring: (1) Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians to avoid local minima and ensure global consistency; (2) a Pose Estimation Module leveraging learned 3D priors; and (3) an adaptive Octree Anchor Formation mechanism that dynamically adjusts anchor densities, significantly reducing memory usage. Extensive experiments on challenging benchmarks demonstrate that LongSplat achieves state-of-the-art results, substantially improving rendering quality, pose accuracy, and computational efficiency compared to prior approaches.

Pipeline


Given a casually captured long video without known poses, LongSplat incrementally reconstructs the scene through tightly coupled pose estimation and 3D Gaussian Splatting. (a) Initialization converts MASt3R depth and correspondences into an octree-anchored 3DGS. (b) Global Optimization jointly refines all camera poses and Gaussians for global consistency. (c) Frame Insertion estimates each new frame pose via correspondence-guided PnP, updates octree anchors using unprojected points, and applies photometric refinement. If PnP fails, a fallback triggers global re-optimization to recover. (d) Incremental Optimization alternates between Local Optimization within a visibility-adapted window and periodic Global Optimization to propagate consistent updates across frames.

Octree Anchor Formation

Given an initial sparse voxelized point cloud, we iteratively perform density-guided adaptive voxel splitting and pruning. Voxels with point cloud density (ρ) exceeding a threshold are split, while those with density below the threshold are pruned. Repeated across multiple octree levels, this adaptive octree anchor design significantly reduces memory usage, allowing efficient representation and rendering of large-scale scenes.

Reconstruction Results


We visualize the reconstructed camera poses and pointmaps of LongSplat. Use the controls to switch between scenes.

Novel-view Synthesis Results


We visualize the comparison results of LongSplat with other methods. LongSplat outperforms other methods.

RGB Depth

Try selecting different methods and scenes!

Free/grass Free/hydrant Free/lab Free/pillar Free/road Free/sky Free/stair Hike/forest1 Hike/forest2 Hike/forest3 Hike/garden1 Hike/garden2 Hike/garden3 Hike/indoor Hike/playground Hike/university2 Hike/university3 Hike/university4 Tanks/barn Tanks/francis

Pose estimation results


We visualize the pose estimation results of LongSplat with other methods. LongSplat achieves the best pose accuracy.

traj_vis

Citation


Acknowledgements


This work was supported by NVIDIA Taiwan AI Research & Development Center (TRDC). This research was funded by the National Science and Technology Council, Taiwan, under Grants NSTC 112-2222-E-A49-004-MY2 and 113-2628-E-A49-023-. Yu-Lun Liu acknowledges the Yushan Young Fellow Program by the MOE in Taiwan.

Special thanks to Cookie, who contributed to part of the code implementation🐱.