← SkySplat Articles

Intellectual honesty · This document — its text, layout, math, and diagrams — was 100% generated by Claude Opus 4.8. It should be read as an AI-authored reference map and verified against primary sources before being relied upon.

Serious-games / SkySplat reference · build map

3D Gaussian Splatting
// training pipeline

The full optimization loop, end to end — where the loops are, what shape the data is in at each hop, and a precise statement of every fundamental operation. Stages are tinted by whether they fall out of stock geometry nodes or are an irreducible custom kernel.

expressible in stock geometry nodes

irreducible custom C++ / GPU kernel

data flowing on edges (shape annotated)

Structure-from-Motion // upstream preprocess

SfM turns unposed images into the posed cameras and sparse cloud that S0/S1 consume — and its color ratio is the inverse of the trainer. The load-bearing algorithms (feature extraction, descriptor matching, the robust geometric solvers, bundle adjustment) are all irreducible kernels; native geometry nodes cover only the connective tissue: loop orchestration (RANSAC, incremental registration as repeat/sim zones), representation (keypoints, tracks, cameras, points as point clouds), reprojection-error / inlier arithmetic, and closed-form projection + two-view triangulation. In practice you never build this — you call COLMAP or GLOMAP and import, and feed-forward neural methods (DUSt3R / MASt3R, VGGSfM) increasingly replace the whole block with a network that lives entirely outside the node model.

Amber solid = irreducible kernel · amber dashed = mixed (native loop/arithmetic wrapping a custom solver) · grey = handoff into the main pipeline below

Feature detection & description

custom kernel

Find repeatable keypoints and encode a neighborhood descriptor for each. Classic SIFT takes local extrema of a Difference-of-Gaussians scale space; learned detectors (SuperPoint, ALIKED) regress them with a CNN.

inimagesN × H × W × 3 → keypointsper img · K × 2 descriptorsK × 128

\[ D(x,y,\sigma)=\big(G(k\sigma)-G(\sigma)\big)\!*I,\qquad \text{keypoints}=\text{local extrema in }(x,y,\sigma) \]

Why custom. Image convolution + scale-space extrema (or a CNN) — the same wall as D-SSIM. Not native.

Descriptor matching

custom kernel

For each descriptor, find its nearest neighbor in candidate images; keep only confident, mutually-consistent matches (Lowe's ratio against the second-nearest).

indescriptorsK × 128 → matchesputative corr. / pair

\[ \text{accept if}\quad \frac{d_{\text{NN}_1}}{d_{\text{NN}_2}}<0.8 \quad\text{and}\quad \text{mutual nearest neighbor} \]

Why custom. Nearest-neighbor search in 128-D descriptor space — geometry nodes' "nearest" is 3D-spatial only. Not native.

Geometric verification

RANSAC loop nativesolver custom

Reject matches inconsistent with a single two-view geometry: RANSAC samples minimal sets, fits the fundamental/essential matrix, and counts inliers by the epipolar constraint.

inmatchesper pair → verifiedinliers + relative pose

\[ x'^{\top}F\,x=0,\qquad E=K'^{\top}F\,K,\qquad N_{\text{RANSAC}}=\frac{\log(1-p)}{\log\!\big(1-w^{s}\big)} \]

Split. The RANSAC loop is a native repeat zone and the Sampson inlier test is attribute math — but the minimal 5-/8-point solver needs an SVD. Mixed.

Incremental mapping

loop + reproj nativesolvers custom⟳ register loop

Seed from a strong initial pair, then repeatedly register the next-best image by PnP against existing 3D points, and triangulate new points into the growing model.

inverified pairs+ inliers → posesregistered cams tracks3D points

\[ \textbf{PnP: }\ \min_{R,t}\sum_j\big\lVert \pi\!\big(K[R\,|\,t]X_j\big)-x_j\big\rVert^2,\qquad \textbf{triangulate: }\ A\,X=0\ \ (\text{DLT, null space}) \]

Split. The registration loop is a simulation zone and reprojection is native; 2-view midpoint triangulation is a small native linear solve. But PnP / relative-pose and DLT null-space need SVD / polynomial solvers. Mixed.

Bundle adjustment

custom kernel

Jointly refine every camera, intrinsic, and 3D point to minimize total reprojection error — the nonlinear least-squares heart of SfM, solved with Levenberg-Marquardt exploiting the camera/point sparsity.

inposes + pointsall cams, all tracks → refined[R|t], K, sparse cloud P×3

\[ \min_{\{C_i\},\{X_j\}}\ \sum_{i,j} v_{ij}\,\rho\!\Big(\big\lVert \pi(C_i,X_j)-x_{ij}\big\rVert^2\Big) \]

\[ \textbf{LM normal eqns: }\ (J^{\top}J+\lambda D)\,\delta=-J^{\top}r,\qquad \textbf{sparse Schur complement on the point block} \]

Why custom. A sparse nonlinear-least-squares engine — analytic/auto Jacobians plus a large sparse linear solve — with no autodiff and no sparse solver in nodes. Call Ceres. Emphatically not native.

⟲ hands offThe refined cameras and sparse cloud are exactly the S0 inputs / S1 initialization below. Net: of the six SfM sub-blocks, only the orchestration and arithmetic inside F3–F4 are native; F1, F2, F5 are wholly custom.

Bird's-eye schematic // the trainer

Green = stock geometry nodes · amber = the C++/GPU kernels that must be forked · cyan edges carry data (M = live Gaussian count, grows ~10⁴ → ~10⁶)

Setup // runs once

Inputs

given data

A multi-view capture, posed by Structure-from-Motion (COLMAP). Each training datum is an image plus its calibrated camera; a sparse point cloud seeds the model.

imagesN × H × W × 3 extrinsics[R | t], R∈SO(3), t∈ℝ³ intrinsicsK = (f_x, f_y, c_x, c_y) SfM pointsP × 3 (+ rgb)

\[ \mathbf{x}_{\text{cam}} = R\,\mathbf{x}_{\text{world}} + t, \qquad \begin{bmatrix} u \\ v \end{bmatrix} = \begin{bmatrix} f_x\, x_{\text{cam}}/z_{\text{cam}} + c_x \\[2pt] f_y\, y_{\text{cam}}/z_{\text{cam}} + c_y \end{bmatrix} \]

Note. Poses and intrinsics are the fixed conditioning of the problem; only the Gaussians below are optimized.

Initialize the Gaussian model

stock nodes

Each SfM point becomes a Gaussian. The model is a point cloud with per-point attributes — this is the central object, and it maps one-to-one onto a geometry-nodes point domain.

meanμ ∈ ℝ³ log-scales ∈ ℝ³ quatq ∈ ℝ⁴ opacity logito ∈ ℝ SH coeffsc ∈ ℝ^{(L+1)²×3}

Scale and opacity are stored pre-activation so the optimizer works unconstrained; the covariance is built so it is positive-semidefinite by construction:

\[ S=\operatorname{diag}\!\big(e^{s}\big),\quad \alpha_{\max}=\sigma(o),\qquad \Sigma \;=\; R(q)\,S\,S^{\top}R(q)^{\top}\in\mathbb{R}^{3\times3} \]

Shape. With SH degree L = 3 → 16 coeffs × 3 channels = 48, the per-Gaussian vector is 3 + 3 + 4 + 1 + 48 = 59 floats. Model tensor: M × 59, with M initialized to P (~10⁴) and free to change at S10.

Training loop // ×K (≈30k)

⟳ outer loopRepeat S2–S9 for k = 1 … K. State carried across iterations: the model M×59 plus Adam moments. In geometry nodes this is a simulation/repeat zone.

Sample a training view

stock nodes

Draw one camera (and its ground-truth image) from the N views — usually a shuffled epoch order.

inmodelM×59 → view k[R|t]ₖ, Kₖ, I_gtₖ : H×W×3

Note. Feeds the GT image via the image sequence frame index; the camera can live as one point in a "cameras" point cloud (pose + intrinsics as attributes).

View transform & projection (EWA splatting)

stock nodes · tedious

Each 3D Gaussian is pushed into screen space. The mean projects by the camera; the covariance can't be projected exactly, so the local affine approximation of the projection (its Jacobian) transports it.

inμ,ΣM×3, M×3×3 → μ′M×2 (screen) Σ′M×2×2 → conic M×3 depthM×1

With camera-space mean t = (t_x,t_y,t_z), the projection Jacobian and the screen-space covariance are

\[ J=\begin{bmatrix} f_x/t_z & 0 & -f_x t_x/t_z^{2}\\[2pt] 0 & f_y/t_z & -f_y t_y/t_z^{2} \end{bmatrix}, \qquad \Sigma' \;=\; J\,R\,\Sigma\,R^{\top}J^{\top}\ \big|_{2\times2}\;+\;h\,I \]

The conic is the inverse 2D covariance used to evaluate the splat at a pixel; writing \(\Sigma'=\begin{psmallmatrix}a&b\\b&c\end{psmallmatrix}\):

\[ \Sigma'^{-1}=\frac{1}{ac-b^{2}}\begin{bmatrix} c & -b\\ -b & a\end{bmatrix},\qquad r \;=\; \big\lceil 3\sqrt{\lambda_{\max}(\Sigma')}\,\big\rceil \quad\text{(footprint radius)} \]

Note. The +hI term (h≈0.3 px²) is a low-pass dilation guaranteeing each splat covers ≥1 pixel — it suppresses aliasing of sub-pixel Gaussians. Doable in nodes as matrix attribute math, but verbose.

Spherical-harmonic color

stock nodes · tedious

View-dependent radiance: each Gaussian's color is an SH expansion evaluated along the viewing direction d (unit vector camera→Gaussian).

inSH coeffsM×16×3dird = (μ−o)/‖μ−o‖ → colorc : M×3

\[ c_i \;=\; \operatorname{clamp}\!\Big( 0.5+\sum_{\ell=0}^{L}\sum_{m=-\ell}^{\ell} c_i^{\ell m}\,Y_\ell^{m}(d),\; 0\Big), \qquad Y_0^{0}=0.2820948 \]

Note. Degree 0 alone gives flat, view-independent color (a good v0). Higher degrees add specular/view variation. If you skip SH in a first fork, this whole stage disappears.

Tile binning & depth sort

custom kernel

Screen is cut into 16×16 tiles. Every Gaussian is replicated into each tile its footprint touches, keyed by (tile_id, depth); one global sort then yields, per tile, a front-to-back ordered list.

inμ′, r, depthM×2, M, M → sorted keys≥ M (one per tile-touch) tile ranges#tiles × 2

\[ \text{key}_{i,\tau}=\big(\underbrace{\tau}_{\text{32b tile}}\,\|\,\underbrace{\text{depth}_i}_{\text{32b}}\big)\in\mathbb{Z}_{64},\qquad \text{sort}\uparrow\ \Rightarrow\ \text{per-tile ordered runs} \]

Why custom. Geometry nodes have no per-tile ordered radix sort over a variable-fanout replication. This is the spatial data-structure step the rasterizer needs.

Rasterize — alpha compositing

custom kernel⟳ inner loop

The heart of the method. Each pixel walks its tile's sorted list front-to-back, accumulating color weighted by per-Gaussian alpha and remaining transmittance — the volumetric "over" operator — and stops once it is opaque.

inconic, μ′, c, α_maxper Gaussian → render ÎH×W×3 auxT_final, n_contrib : H×W

For pixel p, the i-th Gaussian's evaluated weight and the running transmittance are

\[ \alpha_i \;=\; \alpha_{\max,i}\,\exp\!\Big(\!-\tfrac12 (p-\mu'_i)^{\top}\Sigma'^{-1}_i (p-\mu'_i)\Big),\qquad T_i=\prod_{j

\[ \hat C(p)=\sum_{i=1}^{N_p} c_i\,\alpha_i\,T_i, \qquad \hat A(p)=\sum_{i=1}^{N_p}\alpha_i T_i, \qquad \textbf{stop when } T_i<10^{-4} \]

The inner loop. N_p = the per-pixel depth-ordered count, variable, terminated early by transmittance. Why custom: ordered scatter-composite with early-out — no gather-only field model expresses it.

Loss

L₁ nativeD-SSIM kernel

A pixel-wise term for fidelity plus a structural term for local contrast/structure, blended.

inÎ, I_gtH×W×3 → scalarL grad∂L/∂Î : H×W×3

\[ \mathcal{L}=(1-\lambda)\,\mathcal{L}_1+\lambda\,\mathcal{L}_{\text{D-SSIM}},\qquad \mathcal{L}_1=\tfrac{1}{|I|}\textstyle\sum_p\lVert \hat C(p)-I_{\text{gt}}(p)\rVert_1,\qquad \lambda\approx0.2 \]

\[ \text{SSIM}=\frac{(2\mu_x\mu_y+C_1)(2\sigma_{xy}+C_2)}{(\mu_x^2+\mu_y^2+C_1)(\sigma_x^2+\sigma_y^2+C_2)},\qquad \mathcal{L}_{\text{D-SSIM}}=1-\text{SSIM} \]

Note. SSIM statistics are computed over an 11×11 Gaussian window → a convolution. L₁ is native attribute math; the windowed conv + its gradient is the custom piece — drop it for an L₁-only v0.

Backward pass (analytic)

custom kernel⟳ reverse inner

No autodiff — every gradient is closed-form. The compositing list is walked back-to-front, distributing ∂L/∂Î onto each Gaussian's parameters and chaining back through projection.

in∂L/∂ÎH×W×3fwd stateT_final, lists → ∂L/∂θM×59

The load-bearing gradient is alpha's, because raising αᵢ both adds its own color and dims everything behind it. With Sᵢ = Σⱼ>ᵢ cⱼαⱼTⱼ (color accumulated behind i):

\[ \frac{\partial \hat C}{\partial \alpha_i}=c_i\,T_i-\frac{S_i}{1-\alpha_i},\qquad \frac{\partial \alpha_i}{\partial \alpha_{\max,i}}=G_i,\qquad \frac{\partial \alpha_i}{\partial G_i}=\alpha_{\max,i} \]

then the chain unwinds through the Gaussian, the conic, the covariance, and the projection:

\[ \frac{\partial \mathcal L}{\partial G_i}\!\to\!\frac{\partial \mathcal L}{\partial \mu'_i},\frac{\partial \mathcal L}{\partial \Sigma'^{-1}_i} \;\to\; \frac{\partial \mathcal L}{\partial \Sigma'_i}\!\to\!\frac{\partial \mathcal L}{\partial \Sigma_i}\;(\text{via } JR)\;\to\;\frac{\partial \mathcal L}{\partial q_i},\frac{\partial \mathcal L}{\partial s_i};\quad \frac{\partial \mathcal L}{\partial \mu'_i}\!\to\!\frac{\partial \mathcal L}{\partial \mu_i} \]

Why custom & why it's the work. This is the hand-derived adjoint of S3–S6. Gradient-checking every term against finite differences is the real labor of the fork. Tᵢ is recovered from the stored final transmittance while walking backward.

Optimizer — Adam

stock nodes

Per-parameter adaptive step with bias-corrected first/second moments. Moments persist across iterations as extra attributes in the loop's state.

ing = ∂L/∂θM×59m, vM×59 each → θ′M×59

\[ m\leftarrow\beta_1 m+(1-\beta_1)g,\quad v\leftarrow\beta_2 v+(1-\beta_2)g^2,\quad \hat m=\tfrac{m}{1-\beta_1^{k}},\ \hat v=\tfrac{v}{1-\beta_2^{k}} \]

\[ \theta \leftarrow \theta-\eta\,\frac{\hat m}{\sqrt{\hat v}+\epsilon}, \qquad \eta_{\text{pos}}(t)=\eta_0\Big(\tfrac{\eta_T}{\eta_0}\Big)^{t/T}\ \text{(exp. decay; other groups fixed)} \]

Note. Each parameter group (position, SH, opacity, scale, rotation) carries its own learning rate; only position's decays. Fully buildable in attribute-math nodes.

⟲ back to S2Next iteration with the updated model. The loop is the optimization; one pass over K≈30k steps is a full train.

Adaptive density control // periodic

⟳ every ~100 itInterleaved into the loop (after warm-up ~500, until ~15k). This is the only stage that changes M — it grows detail where the scene is under-fit and culls waste.

S10

Clone · split · prune

stock nodes

Decisions are driven by the accumulated view-space positional gradient — high gradient means the region wants more representational capacity.

\[ \tau_i \;=\; \big\langle\, \lVert \partial \mathcal L/\partial \mu'_i\rVert \,\big\rangle_{\text{interval}} \;>\; \tau_{\text{th}}\;(\approx2\!\times\!10^{-4}) \;\Rightarrow\; \text{densify } i \]

Then by world-size, with ‖e^{s}‖ small ⇒ clone, large ⇒ split (two children, shrunk, repositioned by sampling the parent's own distribution):

\[ \textbf{split: }\ s' = s-\ln\varphi\ (\varphi\!\approx\!1.6),\qquad \mu' = \mu + R\,S\,\xi,\quad \xi\sim\mathcal N(0,I) \]

\[ \textbf{prune: }\ \sigma(o_i)<\epsilon_\alpha\ \ (\approx0.005)\ \ \text{or footprint too large};\qquad \textbf{reset: }\ o\leftarrow\sigma^{-1}(0.01)\ \text{every}\sim3000 \]

effectMgrows / shrinks → new M×59

Note. Clone = Duplicate Elements, prune = Delete Geometry, split = duplicate + scale + jittered resample — all native topology ops. The trigger needs ∂L/∂μ′ from S8, accumulated in the loop state.

Data-shape ledger

Object	Shape	Lives where
Model parameters θ	M × 59	μ:3, s:3, q:4, o:1, SH:48 — M ≈ 10⁴→10⁶
Adam moments m, v	M × 59 (×2)	loop state
Gradients ∂L/∂θ	M × 59	output of S8
Screen means μ′	M × 2	S3 → S6
Conic (Σ′⁻¹)	M × 3	symmetric 2×2, packed
Depth / radius	M × 1 each	sort keys + culling
Per-Gaussian color c	M × 3	S4 → S6
Sorted tile keys	≥ M (Σ tile-touches)	S5; variable fanout
Ground-truth view I_gt	H × W × 3	one sampled per iteration
Render Î (+ aux)	H × W × 3 (+ T, n: H×W)	S6; aux retained for S8
Loss gradient ∂L/∂Î	H × W × 3	S7 → S8

The three loop scopes

Loop	Span	Count / exit	Nodes-equivalent
Outer — training	S2 → S9	K ≈ 30,000	simulation / repeat zone
Inner — per pixel	inside S6 / S8	N_p, early-out T<1e-4	inside the custom kernel
Periodic — densify	branch at S10	every ~100 it (warm-up→~15k)	duplicate / delete elements

minimal trainer fork ≈ S5 + S6 + S8 (one differentiable rasterizer, fwd+bwd) · everything green stays stock geometry nodes · SfM (F1–F5) is upstream and run externally — COLMAP / GLOMAP / neural

← Back to SkySplat Articles