SEAR: Bridging RGB and Thermal for Robust 3D Reconstruction

rendering

INFORMATION

Authors: Vsevolod Skorokhodov, Chenghao Xu, Shuo Sun, Olga Fink, Malcolm Mielle

PDF Code Weights Dataset

In this work, Vsevolod Skorokhodov, Chenghao Xu, Shuo Sun, Olga Fink and I developed SEAR, a simple yet efficient fine-tuning strategy to adapt visual geometric transformers for RGB+Thermal 3D reconstruction.

Obtaining consistent 3D reconstructions from mixed sensing modalities is challenging: while foundational models like VGGT excel with RGB inputs, they struggle to align RGB and thermal images when processed jointly, often producing disjoint reconstructions. We propose SEAR, which bridges this modality gap with minimal parameter updates, enabling reliable multimodal pose estimation and reconstruction even under challenging conditions such as low lighting and dense smoke.

Method

SEAR adapts a pretrained VGGT model for joint estimation of RGB and thermal camera parameters and depth maps through three key innovations:

LoRA Integration: Lightweight LoRA adapters are integrated into all linear and multi-head attention layers of the alternating-attention (AA) module, preserving pretrained RGB knowledge while enabling adaptation to mixed inputs.
Thermal Camera Token: We introduce learnable thermal camera tokens (counterparts to VGGT’s RGB camera tokens) to capture modality-specific features and enable differentiated processing of RGB and thermal inputs.
Batching Strategy: Independent RGB and thermal images are batched together without shared camera poses, forcing the model to learn inter-modal relationships across viewpoints rather than relying on trivial correspondences.

The complete architecture uses <5% of the original model’s parameters and maintains negligible memory and inference overhead compared to the base VGGT model.

Results

Our experiments show that SEAR significantly outperforms state-of-the-art methods across all metrics. Despite being trained on a relatively small RGB-T dataset (~15,000 pairs), SEAR achieves over 30% improvement in AUC@30 for camera pose estimation, while delivering higher detail and consistency between modalities. The method also achieves the best point cloud reconstruction metrics (PCC of 0.06, PCA of 0.47, Chamfer distance of 0.27) compared to all baselines.

SEAR maintains a 100% registration rate and processes frames at ~10 FPS, nearly matching the speed of the original VGGT model and being 200× faster than the closest competitor (MINIMA_ROMA at 0.05 FPS). The approach is robust across varying thermal-to-RGB ratios and works reliably even when modalities are captured at different times or under different lighting conditions.

Additionally, we introduce a new dataset featuring 9 scenes (~2,000 images) with distinct RGB/thermal trajectories captured under varying illumination and viewpoints, providing a robust benchmark for future work in multimodal 3D scene reconstruction.

Quantitative Performance

SEAR outperforms all baselines across pose estimation and point cloud reconstruction metrics on our novel SEAR dataset (9 scenes, 1,890 images with varying illumination/viewpoints):

Method	AUC@30 ↑	RRA@30 ↑	RTA@30 ↑	Reg ↑	FPS ↑
COLMAP + SPSG	74.4	99.9	95.9	27.9	0.66
MINIMAROMA	48.2	65.9	68.9	100	0.04
VGGT	23.3	56.4	56.4	100	10.46
SEAR (ours)	62.8	83.7	84.2	100	9.94

And on public datasets (ThermoScenes, ThermalNeRF, ThermalGaussian, ThermalMix, Radar Forest (RF)):

Method	AUC@30 ↑	RRA@30 ↑	RTA@30 ↑	Chamfer Distance ↓	FPS ↑
COLMAP + SPSG	57.6	82.5	74.6	1.42	0.44
MINIMAROMA	41.0	68.3	63.0	1.03	0.05
VGGT	22.9	50.7	48.5	1.91	10.46
SEAR (ours)	70.0	90.6	87.6	0.27	9.94

Robust alignment: SEAR reconstructs coherent, multimodal point clouds even when RGB and thermal images are captured at different times or under varying lighting (e.g., day vs. night).
Challenging conditions: Works reliably in smoke-occluded scenes (SmokeSeer dataset) where RGB-only methods fail.
Efficiency: 200× faster than MINIMAROMA (10.46 FPS vs. 0.05 FPS).

Malcolm Mielle

Explorer

Recent Publications

SEAR

D-CAT

SEAR

SEAR: Bridging RGB and Thermal for Robust 3D Reconstruction

Method

Results

Quantitative Performance

Graph View

Table of Contents