A lightweight and robust driver gaze estimation system powered by self-supervised learning and geometry guidance.
GazeLoom is a driver gaze estimation framework designed for intelligent traffic safety and human-vehicle interaction.
It leverages multi-modal geometric guidance and self-supervised feature extraction to accurately predict driver gaze points in 3D space.
- πΉ Lightweight Model β only 4.97M parameters
- πΉ High-Precision Estimation β joint prediction of head pose and eye movement
- πΉ Real-Time Inference β suitable for in-vehicle edge deployment
- πΉ Strong Generalization β robust to lighting changes, occlusions, and pose variations
- π ONNX Deployment β Run
onnx.pyto export and deploy GazeLoom in ONNX format. Google Drive: Download
Before training or evaluation, please download the required datasets and run the corresponding preprocessing scripts.
The preprocessing scripts convert raw annotations into a unified JSON format for each split, including head bounding boxes, gaze points, in/out labels, and metadata required by GazeLoom.
| Dataset | Download | Preprocessing Script |
|---|---|---|
| π GazeFollow | Download | data_prep/preprocess_gazefollow.py |
| π₯ VideoAttentionTarget | Download | data_prep/preprocess_vat.py |
| π§ ChildPlay | Download | data_prep/preprocess_childplay.py |
| π GOO-Real | Download | data_prep/preprocess_goo_real.py |
python data_prep/preprocess_gazefollow.py \
--data_path /path/to/gazefollow/data_newpython data_prep/preprocess_vat.py \
--data_path /path/to/videoattentiontargetpython data_prep/preprocess_childplay.py \
--data_path /path/to/childplaypython data_prep/preprocess_goo_real.py \
--data_path /path/to/goo_realAfter preprocessing, each dataset directory will contain JSON annotation files that can be directly used for GazeLoom training and evaluation.
GazeLoom uses Depth Anything V2 to generate monocular depth maps as geometric guidance.
Download official Depth Anything V2 checkpoints:
| Model | Checkpoint |
|---|---|
| Depth-Anything-V2-Small | Download |
| Depth-Anything-V2-Base | Download |
| Depth-Anything-V2-Large | Download |
Place the downloaded checkpoint under:
checkpoints/
βββ depth_anything_v2_vitl.pth
python depthany/depth.py \
--img_path /path/to/images \
--outdir /path/to/depth \
--encoder vitl \
--checkpoint checkpoints/depth_anything_v2_vitl.pth \
--input_size 518 \
--grayscaleFor GazeLoom datasets, the generated depth maps should preserve the same relative image paths:
dataset_root/
βββ gazefollow/
β βββ xxx.jpg
βββ videoattentiontarget/
β βββ xxx.jpg
βββ depth/
βββ xxx.png
We provide training scripts in scripts/ for training GazeLoom with the SimDINOv2 backbone.
Before running the training script, please:
- π₯ Download the dataset and run the preprocessing script following the Data Processing section.
- π Prepare the extracted depth maps if geometry-guided training is enabled.
- π Optionally install
wandbfor metric logging:
pip install wandbBy default, checkpoints are saved to ./experiments.
You can use --ckpt_save_dir to customize the checkpoint directory.
Train GazeLoom with the SimDINOv2 ViT-B/14 backbone:
python scripts/train_gazefollow.py \
--data_path /path/to/gazefollow/data_new \
--model gazeloom_cgf_simdinov2_vitb14_inout \
--exp_name train_gazeloom_simdinov2_vitb14_gazefollow \
--batch_size 48 \
--max_epochs 30 \
--lr 5e-4Train GazeLoom with the SimDINOv2 ViT-L/14 backbone:
python scripts/train_gazefollow.py \
--data_path /path/to/gazefollow/data_new \
--model gazeloom_cgf_simdinov2_vitl14_inout \
--exp_name train_gazeloom_simdinov2_vitl14_gazefollow \
--batch_size 32 \
--max_epochs 30 \
--lr 5e-4Resume training from a saved checkpoint:
python scripts/train_gazefollow.py \
--data_path /path/to/gazefollow/data_new \
--model gazeloom_cgf_simdinov2_vitb14_inout \
--resume /path/to/checkpoint.pt \
--exp_name resume_gazeloom_simdinov2_vitb14By default, the backbone is frozen.
To fine-tune the last several SimDINOv2 transformer blocks, use --unfreeze_layers:
python scripts/train_gazefollow.py \
--data_path /path/to/gazefollow/data_new \
--model gazeloom_cgf_simdinov2_vitb14_inout \
--unfreeze_layers 2 \
--exp_name finetune_gazeloom_simdinov2_vitb14| Argument | Description |
|---|---|
--model |
Model name, e.g. gazeloom_cgf_simdinov2_vitb14_inout |
--data_path |
Path to the preprocessed dataset |
--ckpt_save_dir |
Directory for saving checkpoints |
--exp_name |
Experiment name |
--batch_size |
Training batch size |
--max_epochs |
Number of training epochs |
--lr |
Learning rate for trainable heads |
--resume |
Path to checkpoint for resuming training |
--unfreeze_layers |
Number of final backbone layers to fine-tune |
We provide a visualization script in scripts/visualize.py for qualitative analysis of GazeLoom predictions.
The script automatically detects faces using RetinaFace, predicts gaze heatmaps for each detected person, and saves the visualization results, including:
- π§ detected head bounding boxes
- π΄ predicted gaze target points
- π’ high-response heatmap regions
- π gaze direction lines
- π heatmap overlay images
- π JSON prediction results
python scripts/visualize.py \
--image_dir test \
--depth_dir test_depth \
--ckpt_path checkpoints/epoch_14.pt \
--model_name gazeloom_cgf_simdinov2_vitb14_inout \
--output_dir output| Argument | Description |
|---|---|
--image_dir |
Directory containing input images |
--depth_dir |
Directory containing extracted depth maps |
--ckpt_path |
Path to the trained model checkpoint |
--model_name |
Model architecture used for inference |
--output_dir |
Directory for saving visualization results |
--inout_threshold |
Threshold for filtering out-of-frame gaze predictions |
--heatmap_threshold |
Threshold for highlighting high-response gaze regions |
After running the script, results will be saved under the output directory:
output/
βββ image_result.jpg
βββ image_heatmap_0.jpg
βββ image_overlay_0.jpg
βββ gaze_predictions.json
We provide evaluation scripts in scripts/ to validate GazeLoom on standard gaze-target benchmarks.
Before evaluation, please make sure that:
- π₯ The dataset has been downloaded.
- ποΈ The preprocessing script has been executed.
- π Depth maps have been generated if the model uses geometry guidance.
- π¦ The pretrained checkpoint has been downloaded.
Evaluate GazeLoom on the GazeFollow test split:
python scripts/eval_gazefollow.py \
--data_path /path/to/gazefollow/data_new \
--model_name gazeloom_cgf_simdinov2_vitl14 \
--ckpt_path /path/to/checkpoint.pt \
--batch_size 128The script reports:
| Metric | Description |
|---|---|
| AUC β | Area under the gaze heatmap ROC curve |
| Avg L2 β | Average L2 distance between prediction and ground-truth gaze points |
| Min L2 β | Minimum L2 distance to the closest ground-truth annotation |
Evaluate GazeLoom on VideoAttentionTarget:
python scripts/eval_vat.py \
--data_path /path/to/videoattentiontarget \
--model_name gazeloom_cgf_simdinov2_vitl14_inout \
--ckpt_path /path/to/checkpoint.pt \
--batch_size 64For VideoAttentionTarget, the _inout model is recommended because the dataset includes both in-frame and out-of-frame gaze targets.
Evaluate GazeLoom on ChildPlay:
python scripts/eval_childplay.py \
--data_path /path/to/childplay \
--model_name gazeloom_cgf_simdinov2_vitl14_inout \
--ckpt_path /path/to/checkpoint.pt \
--batch_size 64Evaluate GazeLoom on GOO-Real:
python scripts/eval_goo_real.py \
--data_path /path/to/goo_real \
--model_name gazeloom_cgf_simdinov2_vitl14 \
--ckpt_path /path/to/checkpoint.pt \
--batch_size 64Running on cuda
Evaluating: 100%|ββββββββββββββββββββ| 100/100
AUC: 0.967
Avg L2: 0.112
Min L2: 0.079
The generated visualizations can be used to inspect gaze direction, predicted attention targets, and model behavior under different driving scenarios.
| Feature | Description |
|---|---|
| π§ Geometry-Guided Learning | Combines semantic and geometric priors for robust gaze estimation |
| βοΈ Self-Supervised Backbone | Reduces dependency on large-scale labeled data |
| π Driver-Centric Design | Optimized for railway and in-cabin driving environments |
| β‘ Lightweight Deployment | 4.97M parameters with real-time edge inference capability |
GazeLoom is a lightweight and geometry-guided framework for 3D driver gaze estimation in railway driving scenarios.
It consists of three key stages:
- Feature Extraction
- Geometry Guidance
- Fusion & Prediction
A pre-trained self-supervised backbone, SimDINOv2, is used to encode driving scene images and generate robust global visual representations.
To enhance semantic understanding and spatial perception, GazeLoom incorporates multiple auxiliary cues:
- π«οΈ Depth Map β provides 3D structural priors
- β¨ DISM Saliency Map β highlights attention-relevant visual regions
- π€ Head Pose Features β offer geometric priors of gaze orientation
These features are fed into the Multi-modal Geometry Guidance module for semantic-geometric fusion.
The Multi-modal Geometry Guidance module, abbreviated as MGG, enhances 3D spatial reasoning and structural perception.
The head branch uses head feature maps with pseudo-heatmap supervision to explicitly model local geometric constraints of gaze direction.
The depth branch fuses depth maps and DISM saliency maps to inject global 3D structural priors.
Together, these branches generate a structure-consistent visual-spatial representation for downstream gaze prediction.
The Cross-modal Gating Fusion module, abbreviated as CGF, adaptively integrates semantic and spatial features through a gating attention mechanism.
After fusion, the model performs two prediction tasks:
- π― In-Out Gaze Classification
- π₯ Gaze Heatmap Generation
The entire model is trained using a multi-task joint optimization framework, improving robustness, generalization, and real-time performance.
MGG integrates geometric priors from multiple modalities to enhance robustness under complex driving conditions.
- Facial landmarks
- Head pose
- Eye-region depth features
- Builds multi-modal geometric representations
- Captures spatial relationships between facial structure and orientation
- Applies geometry consistency constraints
- Uses a lightweight transformer to model spatial dependencies
π‘ MGG helps GazeLoom maintain high precision under lighting changes, head rotations, and partial occlusions.
CGF introduces a gating mechanism to dynamically balance semantic and geometric features.
- Learns adaptive weights between geometry and semantic branches
- Prevents over-reliance on a single modality
- Enables geometry-constrained cross-modal fusion
- Improves semantic coherence
- Enhances spatial continuity
- Strengthens generalization and stability
βοΈ CGF improves inter-modal cooperation, making GazeLoom accurate and reliable in real-world in-cabin scenarios.
The GazeLoom architecture estimates 3D driver gaze points through the following pipeline:
- Camera Input
- Face Landmark Extraction
- Head Pose Estimation
- Eye Gaze Vector Modeling
- Multi-modal Geometry Guidance
- Cross-modal Gating Fusion
- 3D Gaze Point Prediction
| Dataset | AUC β | L2 β | AP β |
|---|---|---|---|
| GazeFollow | 0.967 | 0.079 | - |
| VideoAttentionTarget | 0.953 | 0.098 | 0.942 |
GazeLoom achieves strong performance across multiple benchmarks while maintaining a lightweight architecture.
Clone the repository:
git clone https://github.com/yourname/GAZELOOM.git
cd GAZELOOM
To facilitate reproducibility, we provide the training seeds, hyperparameters, optimizer settings, and learning-rate schedules used in our experiments.
β οΈ Due to hardware differences, CUDA/cuDNN behavior, and dataloader randomness, exact numerical results may slightly vary across different environments.
| Item | Configuration |
|---|---|
| π§ Framework | PyTorch 2.2.0 |
| βοΈ CUDA | CUDA 12.1 |
| π₯οΈ GPU | NVIDIA GPU |
| π¦ Backbone | SimDINOv2 |
| π Input Size | 448 Γ 448 |
| π₯ Heatmap Size | 64 Γ 64 |
All experiments are initialized with fixed random seeds:
random.seed(0)
np.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)| Setting | Value |
|---|---|
| Optimizer | Adam |
| Batch Size | 48 |
| Max Epochs | 30 |
| Base Learning Rate | 5e-4 |
| Backbone Learning Rate | 1e-5 |
| Scheduler | CosineAnnealingLR |
| Minimum LR | 1e-7 |
| Weight Decay | Not used by default |
| Backbone | Frozen by default |
| Drop Path | 0.1 |
| CGF Groups | 8 |
| Dataset | Initialization | Epochs | Model |
|---|---|---|---|
| π GazeFollow | From SimDINOv2 backbone | 30 | gazeloom_cgf_simdinov2_vitb14_inout |
| π₯ VideoAttentionTarget | Fine-tuned from GazeFollow checkpoint | 8 | gazeloom_cgf_simdinov2_vitb14_inout |
| π§ ChildPlay | Fine-tuned from GazeFollow checkpoint | 3 | gazeloom_cgf_simdinov2_vitb14_inout |
| π GOO-Real | Fine-tuned from GazeFollow checkpoint | 3 | gazeloom_cgf_simdinov2_vitb14 |
By default, the SimDINOv2 backbone is frozen.
If needed, the last N Transformer blocks can be unfrozen using:
--unfreeze_layers NFor example:
python scripts/train_gazefollow.py \
--data_path /path/to/gazefollow/data_new \
--model gazeloom_cgf_simdinov2_vitb14_inout \
--batch_size 48 \
--max_epochs 30 \
--lr 5e-4 \
--unfreeze_layers 0- The reported results are obtained using the configuration above.
- Checkpoints are saved after each epoch under
./experiments. - Training can be resumed with
--resume /path/to/checkpoint.pt. - Small performance variations may occur due to GPU type, CUDA kernels, and multi-worker dataloading.

















