SEAR: Bridging RGB and Thermal for Robust 3D Reconstruction

INFORMATION
In this work, Vsevolod Skorokhodov, Chenghao Xu, Shuo Sun, Olga Fink and I developed SEAR, a simple yet efficient fine-tuning strategy to adapt visual geometric transformers for RGB+Thermal 3D reconstruction.
Obtaining consistent 3D reconstructions from mixed sensing modalities is challenging: while foundational models like VGGT excel with RGB inputs, they struggle to align RGB and thermal images when processed jointly, often producing disjoint reconstructions. We propose SEAR, which bridges this modality gap with minimal parameter updates, enabling reliable multimodal pose estimation and reconstruction even under challenging conditions such as low lighting and dense smoke.
Method
SEAR adapts a pretrained VGGT model for joint estimation of RGB and thermal camera parameters and depth maps through three key innovations:
-
LoRA Integration: Lightweight LoRA adapters are integrated into all linear and multi-head attention layers of the alternating-attention (AA) module, preserving pretrained RGB knowledge while enabling adaptation to mixed inputs.
-
Thermal Camera Token: We introduce learnable thermal camera tokens (counterparts to VGGT’s RGB camera tokens) to capture modality-specific features and enable differentiated processing of RGB and thermal inputs.
-
Batching Strategy: Independent RGB and thermal images are batched together without shared camera poses, forcing the model to learn inter-modal relationships across viewpoints rather than relying on trivial correspondences.
The complete architecture uses <5% of the original model’s parameters and maintains negligible memory and inference overhead compared to the base VGGT model.
Results
Our experiments show that SEAR significantly outperforms state-of-the-art methods across all metrics. Despite being trained on a relatively small RGB-T dataset (~15,000 pairs), SEAR achieves over 30% improvement in AUC@30 for camera pose estimation, while delivering higher detail and consistency between modalities. The method also achieves the best point cloud reconstruction metrics (PCC of 0.06, PCA of 0.47, Chamfer distance of 0.27) compared to all baselines.
SEAR maintains a 100% registration rate and processes frames at ~10 FPS, nearly matching the speed of the original VGGT model and being 200× faster than the closest competitor (MINIMAROMA at 0.05 FPS). The approach is robust across varying thermal-to-RGB ratios and works reliably even when modalities are captured at different times or under different lighting conditions.
Additionally, we introduce a new dataset featuring 9 scenes (~2,000 images) with distinct RGB/thermal trajectories captured under varying illumination and viewpoints, providing a robust benchmark for future work in multimodal 3D scene reconstruction.
Quantitative Performance
SEAR outperforms all baselines across pose estimation and point cloud reconstruction metrics on our novel SEAR dataset (9 scenes, 1,890 images with varying illumination/viewpoints):
| Method | AUC@30 ↑ | RRA@30 ↑ | RTA@30 ↑ | Reg ↑ | FPS ↑ |
|---|---|---|---|---|---|
| COLMAP + SPSG | 74.4 | 99.9 | 95.9 | 27.9 | 0.66 |
| MINIMAROMA | 48.2 | 65.9 | 68.9 | 100 | 0.04 |
| VGGT | 23.3 | 56.4 | 56.4 | 100 | 10.46 |
| SEAR (ours) | 62.8 | 83.7 | 84.2 | 100 | 9.94 |
And on public datasets (ThermoScenes, ThermalNeRF, ThermalGaussian, ThermalMix, Radar Forest (RF)):
| Method | AUC@30 ↑ | RRA@30 ↑ | RTA@30 ↑ | Chamfer Distance ↓ | FPS ↑ |
|---|---|---|---|---|---|
| COLMAP + SPSG | 57.6 | 82.5 | 74.6 | 1.42 | 0.44 |
| MINIMAROMA | 41.0 | 68.3 | 63.0 | 1.03 | 0.05 |
| VGGT | 22.9 | 50.7 | 48.5 | 1.91 | 10.46 |
| SEAR (ours) | 70.0 | 90.6 | 87.6 | 0.27 | 9.94 |
- Robust alignment: SEAR reconstructs coherent, multimodal point clouds even when RGB and thermal images are captured at different times or under varying lighting (e.g., day vs. night).
- Challenging conditions: Works reliably in smoke-occluded scenes (SmokeSeer dataset) where RGB-only methods fail.
- Efficiency: 200× faster than MINIMAROMA (10.46 FPS vs. 0.05 FPS).