TideGS Breaks Billion-Primitive Barrier in 3D Gaussian Splatting on a Single 24 GB GPU

2026年5月20日 · 20 閲覧 · TideGS 3D Gaussian Splatting out-of-core optimization single-GPU training HKUST

The Memory Bottleneck in 3D Gaussian Splatting

3D Gaussian Splatting (3DGS) has rapidly become a cornerstone technique for novel view synthesis, offering high visual fidelity and fast rendering. However, scaling 3DGS to large environments—whole buildings, city blocks, or detailed objects—has been held back by a severe memory bottleneck. Each Gaussian primitive requires storing a multi-dimensional attribute vector (position, covariance, color, opacity), and the aggregate parameter table for even moderately large scenes quickly exceeds the VRAM of high-end GPUs. Prior systems typically cap out at tens of millions of Gaussians on a single consumer card, limiting the resolution and completeness of reconstructed scenes.

Researchers from the Sponge Computing Lab at HKUST have now introduced a framework called TideGS that fundamentally bypasses this limitation. In a paper published in the Hugging Face Daily Papers (May 2026), the team demonstrates training with over one billion Gaussian primitives on a single NVIDIA RTX 4090 (24 GB VRAM). This is roughly a 10x increase over previous out-of-core 3DGS approaches, which topped out at around 100 million Gaussians, and a 90x jump over standard in-memory training, which could handle only about 11 million primitives on the same hardware.

Leveraging the Sparse, Trajectory-Conditioned Nature of 3DGS Training

The key insight behind TideGS is that 3DGS training is inherently sparse and trajectory-conditioned. In each training iteration, the camera observes the scene from a specific viewpoint. Only the Gaussians that project into the current camera frustum—typically a small fraction of the total set—are actually updated. The remaining Gaussians remain untouched and do not require active GPU memory. This sparsity means the GPU does not need to hold the entire parameter table simultaneously; it can instead treat its VRAM as a working-set cache, swapping in only the currently relevant primitives from slower storage.

Prior attempts to exploit this idea have been limited by the overhead of moving data between GPU, CPU, and SSD. The HKUST team tackled this with a three-tier storage hierarchy managed by three synergistic techniques: block-virtualized geometry, a hierarchical asynchronous pipeline, and trajectory-adaptive differential streaming.

Three Pillars of the TideGS Architecture

Block-virtualized geometry divides the entire scene into spatial blocks, each containing a manageable subset of Gaussians. These blocks are stored on the SSD in an alignment that minimizes I/O seek times. When a training iteration begins, only blocks that intersect the current camera frustum are loaded into CPU memory and then transferred to GPU. This spatial partitioning ensures that data transfer is granular and highly localized.

The hierarchical asynchronous pipeline overlaps data movement with computation. While the GPU is busy updating one batch of visible Gaussians, the CPU prefetches the next batch of blocks from SSD and prepares them for transfer. The pipeline has three stages: SSD-to-CPU fetch, CPU-to-GPU transfer, and GPU kernel execution. The system carefully schedules these stages so that the GPU is never idle waiting for new data. According to the paper, this overlap hides most of the latency associated with SSD reads and PCIe transfers.

Perhaps the most novel component is trajectory-adaptive differential streaming. Instead of reloading entire blocks each iteration, TideGS tracks which Gaussians have changed since they were last in the GPU's working set. It then transfers only the incremental updates—the deltas—for those primitives. This dramatically reduces the data volume transferred across the hierarchy, because only a handful of Gaussians per block are modified in any given step. The system adapts the transfer schedule dynamically based on the camera trajectory, predicting which blocks will be needed next.

Quantitative Results: Reconstruction Quality and Scaling

The team evaluated TideGS on large-scale scenes from datasets such as MatrixCity, Mill19, and custom drone captures. On a single 24 GB GPU, TideGS trained models with 500 million to 1.2 billion Gaussians. Across all tested scenes, it achieved the best reconstruction quality among evaluated single-GPU baselines, as measured by PSNR, SSIM, and LPIPS metrics.

For example, on the large MatrixCity dataset, TideGS with 500 million Gaussians reached a PSNR of 28.6 dB, compared to 25.9 dB for the leading out-of-core baseline that used 100 million Gaussians and 24.1 dB for in-memory training with 11 million Gaussians. The improvements are visually striking: fine details like street signs, foliage, and window reflections are faithfully reproduced, where simpler models produce blur or artifacts.

Training time for the billion-primitive model was approximately 48 hours on a single RTX 4090—remarkably fast given the scale. The asynchronous pipeline kept GPU utilization above 90% for most of the training loop.

Implications for 3D Graphics, VR, and Robotics

The ability to train billion-primitive 3DGS models on consumer hardware could have wide-ranging impact. For the 3D graphics community, it means that high-fidelity digital twins of real-world locations can be created without access to server farms. Virtual reality experiences could capture entire buildings or outdoor environments with unprecedented detail. In robotics, large-scale scene representations are critical for navigation and manipulation in complex environments, and TideGS enables training those representations on the robot's own onboard GPU.

Furthermore, the underlying out-of-core framework is not limited to 3DGS. The principles of block virtualization, hierarchical pipelining, and differential streaming could be applied to other memory-intensive machine learning workloads where training data or parameters are too large to fit in GPU memory. The researchers have released the code as open source on GitHub (github.com/sponge-lab/TideGS), and the project has already accumulated over a dozen stars within days of posting.

One limitation acknowledged in the paper is that the current implementation is optimized for a single GPU. Multi-GPU scaling remains an open challenge, though the trajectory-conditioned sparsity suggests that model parallelism across devices might be feasible. The team also notes that the SSD demands are substantial: a billion Gaussians with full attributes can occupy more than 100 GB of storage.

TideGS arrives at a moment when 3D Gaussian Splatting is transitioning from a research novelty to a practical tool. By removing the memory ceiling that previously constrained scene scale, the HKUST team has provided a clear path forward for real-world deployment. The community will be watching to see how quickly other groups adopt and build upon this architecture.

Source: HuggingFace Papers

345tool Editorial Team

We are a team of AI technology enthusiasts and researchers dedicated to discovering, testing, and reviewing the latest AI tools to help users find the right solutions for their needs.

我们是一支由 AI 技术爱好者和研究人员组成的团队，致力于发现、测试和评测最新的 AI 工具，帮助用户找到最适合自己的解决方案。

Loading comments...

The Memory Bottleneck in 3D Gaussian Splatting

Leveraging the Sparse, Trajectory-Conditioned Nature of 3DGS Training

Three Pillars of the TideGS Architecture

Quantitative Results: Reconstruction Quality and Scaling

Implications for 3D Graphics, VR, and Robotics

コメント