Intellectual honesty · This document — its text, layout, math, and diagrams — was 100% generated by Claude Opus 4.8. It should be read as an AI-authored reference map and verified against primary sources before being relied upon.
Serious-games / SkySplat reference · build map
The full optimization loop, end to end — where the loops are, what shape the data is in at each hop, and a precise statement of every fundamental operation. Stages are tinted by whether they fall out of stock geometry nodes or are an irreducible custom kernel.
SfM turns unposed images into the posed cameras and sparse cloud that S0/S1 consume — and its color ratio is the inverse of the trainer. The load-bearing algorithms (feature extraction, descriptor matching, the robust geometric solvers, bundle adjustment) are all irreducible kernels; native geometry nodes cover only the connective tissue: loop orchestration (RANSAC, incremental registration as repeat/sim zones), representation (keypoints, tracks, cameras, points as point clouds), reprojection-error / inlier arithmetic, and closed-form projection + two-view triangulation. In practice you never build this — you call COLMAP or GLOMAP and import, and feed-forward neural methods (DUSt3R / MASt3R, VGGSfM) increasingly replace the whole block with a network that lives entirely outside the node model.
Amber solid = irreducible kernel · amber dashed = mixed (native loop/arithmetic wrapping a custom solver) · grey = handoff into the main pipeline below
Find repeatable keypoints and encode a neighborhood descriptor for each. Classic SIFT takes local extrema of a Difference-of-Gaussians scale space; learned detectors (SuperPoint, ALIKED) regress them with a CNN.
Why custom. Image convolution + scale-space extrema (or a CNN) — the same wall as D-SSIM. Not native.
For each descriptor, find its nearest neighbor in candidate images; keep only confident, mutually-consistent matches (Lowe's ratio against the second-nearest).
Why custom. Nearest-neighbor search in 128-D descriptor space — geometry nodes' "nearest" is 3D-spatial only. Not native.
Reject matches inconsistent with a single two-view geometry: RANSAC samples minimal sets, fits the fundamental/essential matrix, and counts inliers by the epipolar constraint.
Split. The RANSAC loop is a native repeat zone and the Sampson inlier test is attribute math — but the minimal 5-/8-point solver needs an SVD. Mixed.
Seed from a strong initial pair, then repeatedly register the next-best image by PnP against existing 3D points, and triangulate new points into the growing model.
Split. The registration loop is a simulation zone and reprojection is native; 2-view midpoint triangulation is a small native linear solve. But PnP / relative-pose and DLT null-space need SVD / polynomial solvers. Mixed.
Jointly refine every camera, intrinsic, and 3D point to minimize total reprojection error — the nonlinear least-squares heart of SfM, solved with Levenberg-Marquardt exploiting the camera/point sparsity.
Why custom. A sparse nonlinear-least-squares engine — analytic/auto Jacobians plus a large sparse linear solve — with no autodiff and no sparse solver in nodes. Call Ceres. Emphatically not native.
Green = stock geometry nodes · amber = the C++/GPU kernels that must be forked · cyan edges carry data (M = live Gaussian count, grows ~10⁴ → ~10⁶)
A multi-view capture, posed by Structure-from-Motion (COLMAP). Each training datum is an image plus its calibrated camera; a sparse point cloud seeds the model.
Note. Poses and intrinsics are the fixed conditioning of the problem; only the Gaussians below are optimized.
Each SfM point becomes a Gaussian. The model is a point cloud with per-point attributes — this is the central object, and it maps one-to-one onto a geometry-nodes point domain.
Scale and opacity are stored pre-activation so the optimizer works unconstrained; the covariance is built so it is positive-semidefinite by construction:
Shape. With SH degree L = 3 → 16 coeffs × 3 channels = 48, the per-Gaussian vector is 3 + 3 + 4 + 1 + 48 = 59 floats. Model tensor: M × 59, with M initialized to P (~10⁴) and free to change at S10.
Draw one camera (and its ground-truth image) from the N views — usually a shuffled epoch order.
Note. Feeds the GT image via the image sequence frame index; the camera can live as one point in a "cameras" point cloud (pose + intrinsics as attributes).
Each 3D Gaussian is pushed into screen space. The mean projects by the camera; the covariance can't be projected exactly, so the local affine approximation of the projection (its Jacobian) transports it.
With camera-space mean t = (t_x,t_y,t_z), the projection Jacobian and the screen-space covariance are
The conic is the inverse 2D covariance used to evaluate the splat at a pixel; writing \(\Sigma'=\begin{psmallmatrix}a&b\\b&c\end{psmallmatrix}\):
Note. The +hI term (h≈0.3 px²) is a low-pass dilation guaranteeing each splat covers ≥1 pixel — it suppresses aliasing of sub-pixel Gaussians. Doable in nodes as matrix attribute math, but verbose.
View-dependent radiance: each Gaussian's color is an SH expansion evaluated along the viewing direction d (unit vector camera→Gaussian).
Note. Degree 0 alone gives flat, view-independent color (a good v0). Higher degrees add specular/view variation. If you skip SH in a first fork, this whole stage disappears.
Screen is cut into 16×16 tiles. Every Gaussian is replicated into each tile its footprint touches, keyed by (tile_id, depth); one global sort then yields, per tile, a front-to-back ordered list.
Why custom. Geometry nodes have no per-tile ordered radix sort over a variable-fanout replication. This is the spatial data-structure step the rasterizer needs.
The heart of the method. Each pixel walks its tile's sorted list front-to-back, accumulating color weighted by per-Gaussian alpha and remaining transmittance — the volumetric "over" operator — and stops once it is opaque.
For pixel p, the i-th Gaussian's evaluated weight and the running transmittance are
The inner loop. N_p = the per-pixel depth-ordered count, variable, terminated early by transmittance. Why custom: ordered scatter-composite with early-out — no gather-only field model expresses it.
A pixel-wise term for fidelity plus a structural term for local contrast/structure, blended.
Note. SSIM statistics are computed over an 11×11 Gaussian window → a convolution. L₁ is native attribute math; the windowed conv + its gradient is the custom piece — drop it for an L₁-only v0.
No autodiff — every gradient is closed-form. The compositing list is walked back-to-front, distributing ∂L/∂Î onto each Gaussian's parameters and chaining back through projection.
The load-bearing gradient is alpha's, because raising αᵢ both adds its own color and dims everything behind it. With Sᵢ = Σⱼ>ᵢ cⱼαⱼTⱼ (color accumulated behind i):
then the chain unwinds through the Gaussian, the conic, the covariance, and the projection:
Why custom & why it's the work. This is the hand-derived adjoint of S3–S6. Gradient-checking every term against finite differences is the real labor of the fork. Tᵢ is recovered from the stored final transmittance while walking backward.
Per-parameter adaptive step with bias-corrected first/second moments. Moments persist across iterations as extra attributes in the loop's state.
Note. Each parameter group (position, SH, opacity, scale, rotation) carries its own learning rate; only position's decays. Fully buildable in attribute-math nodes.
Decisions are driven by the accumulated view-space positional gradient — high gradient means the region wants more representational capacity.
Then by world-size, with ‖e^{s}‖ small ⇒ clone, large ⇒ split (two children, shrunk, repositioned by sampling the parent's own distribution):
Note. Clone = Duplicate Elements, prune = Delete Geometry, split = duplicate + scale + jittered resample — all native topology ops. The trigger needs ∂L/∂μ′ from S8, accumulated in the loop state.
| Object | Shape | Lives where |
|---|---|---|
| Model parameters θ | M × 59 | μ:3, s:3, q:4, o:1, SH:48 — M ≈ 10⁴→10⁶ |
| Adam moments m, v | M × 59 (×2) | loop state |
| Gradients ∂L/∂θ | M × 59 | output of S8 |
| Screen means μ′ | M × 2 | S3 → S6 |
| Conic (Σ′⁻¹) | M × 3 | symmetric 2×2, packed |
| Depth / radius | M × 1 each | sort keys + culling |
| Per-Gaussian color c | M × 3 | S4 → S6 |
| Sorted tile keys | ≥ M (Σ tile-touches) | S5; variable fanout |
| Ground-truth view I_gt | H × W × 3 | one sampled per iteration |
| Render Î (+ aux) | H × W × 3 (+ T, n: H×W) | S6; aux retained for S8 |
| Loss gradient ∂L/∂Î | H × W × 3 | S7 → S8 |
| Loop | Span | Count / exit | Nodes-equivalent |
|---|---|---|---|
| Outer — training | S2 → S9 | K ≈ 30,000 | simulation / repeat zone |
| Inner — per pixel | inside S6 / S8 | N_p, early-out T<1e-4 | inside the custom kernel |
| Periodic — densify | branch at S10 | every ~100 it (warm-up→~15k) | duplicate / delete elements |
minimal trainer fork ≈ S5 + S6 + S8 (one differentiable rasterizer, fwd+bwd) · everything green stays stock geometry nodes · SfM (F1–F5) is upstream and run externally — COLMAP / GLOMAP / neural