⚡ GazeLoom ⚡

Train Driver Gaze Estimation Framework

A lightweight and robust driver gaze estimation system powered by self-supervised learning and geometry guidance.

🚀 About

GazeLoom is a driver gaze estimation framework designed for intelligent traffic safety and human-vehicle interaction.

It leverages multi-modal geometric guidance and self-supervised feature extraction to accurately predict driver gaze points in 3D space.

Highlights

🔹 Lightweight Model — only 4.97M parameters
🔹 High-Precision Estimation — joint prediction of head pose and eye movement
🔹 Real-Time Inference — suitable for in-vehicle edge deployment
🔹 Strong Generalization — robust to lighting changes, occlusions, and pose variations
🚀 ONNX Deployment — Run onnx.py to export and deploy GazeLoom in ONNX format. Google Drive: Download

🗂️ Data Processing

Before training or evaluation, please download the required datasets and run the corresponding preprocessing scripts.

The preprocessing scripts convert raw annotations into a unified JSON format for each split, including head bounding boxes, gaze points, in/out labels, and metadata required by GazeLoom.

Dataset	Download	Preprocessing Script
👀 GazeFollow	Download	`data_prep/preprocess_gazefollow.py`
🎥 VideoAttentionTarget	Download	`data_prep/preprocess_vat.py`
🧒 ChildPlay	Download	`data_prep/preprocess_childplay.py`
🛒 GOO-Real	Download	`data_prep/preprocess_goo_real.py`

👀 GazeFollow

python data_prep/preprocess_gazefollow.py \
  --data_path /path/to/gazefollow/data_new

🎥 VideoAttentionTarget

python data_prep/preprocess_vat.py \
  --data_path /path/to/videoattentiontarget

🧒 ChildPlay

python data_prep/preprocess_childplay.py \
  --data_path /path/to/childplay

🛒 GOO-Real

python data_prep/preprocess_goo_real.py \
  --data_path /path/to/goo_real

After preprocessing, each dataset directory will contain JSON annotation files that can be directly used for GazeLoom training and evaluation.

🌊 Depth Map Extraction

GazeLoom uses Depth Anything V2 to generate monocular depth maps as geometric guidance.

Pretrained Weights

Download official Depth Anything V2 checkpoints:

Model	Checkpoint
Depth-Anything-V2-Small	Download
Depth-Anything-V2-Base	Download
Depth-Anything-V2-Large	Download

Place the downloaded checkpoint under:

checkpoints/
└── depth_anything_v2_vitl.pth

Extract Depth Maps

python depthany/depth.py \
  --img_path /path/to/images \
  --outdir /path/to/depth \
  --encoder vitl \
  --checkpoint checkpoints/depth_anything_v2_vitl.pth \
  --input_size 518 \
  --grayscale

For GazeLoom datasets, the generated depth maps should preserve the same relative image paths:

dataset_root/
├── gazefollow/
│   └── xxx.jpg
├── videoattentiontarget/
│   └── xxx.jpg
└── depth/
    └── xxx.png

🏋️ Train

We provide training scripts in scripts/ for training GazeLoom with the SimDINOv2 backbone.

Before running the training script, please:

📥 Download the dataset and run the preprocessing script following the Data Processing section.
🌊 Prepare the extracted depth maps if geometry-guided training is enabled.
📊 Optionally install wandb for metric logging:

pip install wandb

By default, checkpoints are saved to ./experiments.
You can use --ckpt_save_dir to customize the checkpoint directory.

👀 GazeFollow

Train GazeLoom with the SimDINOv2 ViT-B/14 backbone:

python scripts/train_gazefollow.py \
  --data_path /path/to/gazefollow/data_new \
  --model gazeloom_cgf_simdinov2_vitb14_inout \
  --exp_name train_gazeloom_simdinov2_vitb14_gazefollow \
  --batch_size 48 \
  --max_epochs 30 \
  --lr 5e-4

Train GazeLoom with the SimDINOv2 ViT-L/14 backbone:

python scripts/train_gazefollow.py \
  --data_path /path/to/gazefollow/data_new \
  --model gazeloom_cgf_simdinov2_vitl14_inout \
  --exp_name train_gazeloom_simdinov2_vitl14_gazefollow \
  --batch_size 32 \
  --max_epochs 30 \
  --lr 5e-4

🔁 Resume Training

Resume training from a saved checkpoint:

python scripts/train_gazefollow.py \
  --data_path /path/to/gazefollow/data_new \
  --model gazeloom_cgf_simdinov2_vitb14_inout \
  --resume /path/to/checkpoint.pt \
  --exp_name resume_gazeloom_simdinov2_vitb14

🔓 Backbone Fine-tuning

By default, the backbone is frozen.
To fine-tune the last several SimDINOv2 transformer blocks, use --unfreeze_layers:

python scripts/train_gazefollow.py \
  --data_path /path/to/gazefollow/data_new \
  --model gazeloom_cgf_simdinov2_vitb14_inout \
  --unfreeze_layers 2 \
  --exp_name finetune_gazeloom_simdinov2_vitb14

⚙️ Main Arguments

Argument	Description
`--model`	Model name, e.g. `gazeloom_cgf_simdinov2_vitb14_inout`
`--data_path`	Path to the preprocessed dataset
`--ckpt_save_dir`	Directory for saving checkpoints
`--exp_name`	Experiment name
`--batch_size`	Training batch size
`--max_epochs`	Number of training epochs
`--lr`	Learning rate for trainable heads
`--resume`	Path to checkpoint for resuming training
`--unfreeze_layers`	Number of final backbone layers to fine-tune

🎨 Visualization

We provide a visualization script in scripts/visualize.py for qualitative analysis of GazeLoom predictions.

The script automatically detects faces using RetinaFace, predicts gaze heatmaps for each detected person, and saves the visualization results, including:

🟧 detected head bounding boxes
🔴 predicted gaze target points
🟢 high-response heatmap regions
📍 gaze direction lines
🌈 heatmap overlay images
📄 JSON prediction results

Run Visualization

python scripts/visualize.py \
  --image_dir test \
  --depth_dir test_depth \
  --ckpt_path checkpoints/epoch_14.pt \
  --model_name gazeloom_cgf_simdinov2_vitb14_inout \
  --output_dir output

Arguments

Argument	Description
`--image_dir`	Directory containing input images
`--depth_dir`	Directory containing extracted depth maps
`--ckpt_path`	Path to the trained model checkpoint
`--model_name`	Model architecture used for inference
`--output_dir`	Directory for saving visualization results
`--inout_threshold`	Threshold for filtering out-of-frame gaze predictions
`--heatmap_threshold`	Threshold for highlighting high-response gaze regions

Output Files

After running the script, results will be saved under the output directory:

output/
├── image_result.jpg
├── image_heatmap_0.jpg
├── image_overlay_0.jpg
└── gaze_predictions.json

🧪 Evaluation

We provide evaluation scripts in scripts/ to validate GazeLoom on standard gaze-target benchmarks.

Before evaluation, please make sure that:

📥 The dataset has been downloaded.
🗂️ The preprocessing script has been executed.
🌊 Depth maps have been generated if the model uses geometry guidance.
📦 The pretrained checkpoint has been downloaded.

👀 GazeFollow

Evaluate GazeLoom on the GazeFollow test split:

python scripts/eval_gazefollow.py \
  --data_path /path/to/gazefollow/data_new \
  --model_name gazeloom_cgf_simdinov2_vitl14 \
  --ckpt_path /path/to/checkpoint.pt \
  --batch_size 128

The script reports:

Metric	Description
AUC ↑	Area under the gaze heatmap ROC curve
Avg L2 ↓	Average L2 distance between prediction and ground-truth gaze points
Min L2 ↓	Minimum L2 distance to the closest ground-truth annotation

🎥 VideoAttentionTarget

Evaluate GazeLoom on VideoAttentionTarget:

python scripts/eval_vat.py \
  --data_path /path/to/videoattentiontarget \
  --model_name gazeloom_cgf_simdinov2_vitl14_inout \
  --ckpt_path /path/to/checkpoint.pt \
  --batch_size 64

For VideoAttentionTarget, the _inout model is recommended because the dataset includes both in-frame and out-of-frame gaze targets.

🧒 ChildPlay

Evaluate GazeLoom on ChildPlay:

python scripts/eval_childplay.py \
  --data_path /path/to/childplay \
  --model_name gazeloom_cgf_simdinov2_vitl14_inout \
  --ckpt_path /path/to/checkpoint.pt \
  --batch_size 64

🛒 GOO-Real

Evaluate GazeLoom on GOO-Real:

python scripts/eval_goo_real.py \
  --data_path /path/to/goo_real \
  --model_name gazeloom_cgf_simdinov2_vitl14 \
  --ckpt_path /path/to/checkpoint.pt \
  --batch_size 64

📊 Example Output

Running on cuda
Evaluating: 100%|████████████████████| 100/100
AUC: 0.967
Avg L2: 0.112
Min L2: 0.079

The generated visualizations can be used to inspect gaze direction, predicted attention targets, and model behavior under different driving scenarios.

📸 Visuals

✨ Key Features

Feature	Description
🧠 Geometry-Guided Learning	Combines semantic and geometric priors for robust gaze estimation
⚙️ Self-Supervised Backbone	Reduces dependency on large-scale labeled data
🚗 Driver-Centric Design	Optimized for railway and in-cabin driving environments
⚡ Lightweight Deployment	4.97M parameters with real-time edge inference capability

🧩 Overall Framework

GazeLoom is a lightweight and geometry-guided framework for 3D driver gaze estimation in railway driving scenarios.

It consists of three key stages:

Feature Extraction
Geometry Guidance
Fusion & Prediction

🎯 Stage 1 — Feature Extraction

A pre-trained self-supervised backbone, SimDINOv2, is used to encode driving scene images and generate robust global visual representations.

To enhance semantic understanding and spatial perception, GazeLoom incorporates multiple auxiliary cues:

🌫️ Depth Map — provides 3D structural priors
✨ DISM Saliency Map — highlights attention-relevant visual regions
👤 Head Pose Features — offer geometric priors of gaze orientation

These features are fed into the Multi-modal Geometry Guidance module for semantic-geometric fusion.

📐 Stage 2 — Geometry Guidance

The Multi-modal Geometry Guidance module, abbreviated as MGG, enhances 3D spatial reasoning and structural perception.

Head Branch

The head branch uses head feature maps with pseudo-heatmap supervision to explicitly model local geometric constraints of gaze direction.

Depth Branch

The depth branch fuses depth maps and DISM saliency maps to inject global 3D structural priors.

Together, these branches generate a structure-consistent visual-spatial representation for downstream gaze prediction.

🔗 Stage 3 — Fusion & Prediction

The Cross-modal Gating Fusion module, abbreviated as CGF, adaptively integrates semantic and spatial features through a gating attention mechanism.

After fusion, the model performs two prediction tasks:

🎯 In-Out Gaze Classification
🔥 Gaze Heatmap Generation

The entire model is trained using a multi-task joint optimization framework, improving robustness, generalization, and real-time performance.

🔍 Module Details

🧩 MGG — Multi-modal Geometry Guidance

MGG integrates geometric priors from multiple modalities to enhance robustness under complex driving conditions.

Input Sources

Facial landmarks
Head pose
Eye-region depth features

Core Functions

Builds multi-modal geometric representations
Captures spatial relationships between facial structure and orientation
Applies geometry consistency constraints
Uses a lightweight transformer to model spatial dependencies

💡 MGG helps GazeLoom maintain high precision under lighting changes, head rotations, and partial occlusions.

🔗 CGF — Cross-modal Gating Fusion

CGF introduces a gating mechanism to dynamically balance semantic and geometric features.

Mechanism

Learns adaptive weights between geometry and semantic branches
Prevents over-reliance on a single modality
Enables geometry-constrained cross-modal fusion

Advantages

Improves semantic coherence
Enhances spatial continuity
Strengthens generalization and stability

⚙️ CGF improves inter-modal cooperation, making GazeLoom accurate and reliable in real-world in-cabin scenarios.

🧠 Architecture Overview

The GazeLoom architecture estimates 3D driver gaze points through the following pipeline:

Camera Input
Face Landmark Extraction
Head Pose Estimation
Eye Gaze Vector Modeling
Multi-modal Geometry Guidance
Cross-modal Gating Fusion
3D Gaze Point Prediction

📊 Datasets & Results

Dataset	AUC ↑	L2 ↓	AP ↑
GazeFollow	0.967	0.079	-
VideoAttentionTarget	0.953	0.098	0.942

GazeLoom achieves strong performance across multiple benchmarks while maintaining a lightweight architecture.

⚙️ Installation

Clone the repository:

git clone https://github.com/yourname/GAZELOOM.git

cd GAZELOOM

🔁 Reproducibility

To facilitate reproducibility, we provide the training seeds, hyperparameters, optimizer settings, and learning-rate schedules used in our experiments.

⚠️ Due to hardware differences, CUDA/cuDNN behavior, and dataloader randomness, exact numerical results may slightly vary across different environments.

🧪 Training Environment

Item	Configuration
🧠 Framework	PyTorch 2.2.0
⚙️ CUDA	CUDA 12.1
🖥️ GPU	NVIDIA GPU
📦 Backbone	SimDINOv2
📐 Input Size	`448 × 448`
🔥 Heatmap Size	`64 × 64`

🎲 Random Seeds

All experiments are initialized with fixed random seeds:

random.seed(0)
np.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)

⚙️ Default Training Configuration

Setting	Value
Optimizer	Adam
Batch Size	48
Max Epochs	30
Base Learning Rate	`5e-4`
Backbone Learning Rate	`1e-5`
Scheduler	CosineAnnealingLR
Minimum LR	`1e-7`
Weight Decay	Not used by default
Backbone	Frozen by default
Drop Path	`0.1`
CGF Groups	`8`

📊 Dataset-specific Schedule

Dataset	Initialization	Epochs	Model
👀 GazeFollow	From SimDINOv2 backbone	30	`gazeloom_cgf_simdinov2_vitb14_inout`
🎥 VideoAttentionTarget	Fine-tuned from GazeFollow checkpoint	8	`gazeloom_cgf_simdinov2_vitb14_inout`
🧒 ChildPlay	Fine-tuned from GazeFollow checkpoint	3	`gazeloom_cgf_simdinov2_vitb14_inout`
🛒 GOO-Real	Fine-tuned from GazeFollow checkpoint	3	`gazeloom_cgf_simdinov2_vitb14`

🔓 Backbone Fine-tuning

By default, the SimDINOv2 backbone is frozen.
If needed, the last N Transformer blocks can be unfrozen using:

--unfreeze_layers N

For example:

python scripts/train_gazefollow.py \
  --data_path /path/to/gazefollow/data_new \
  --model gazeloom_cgf_simdinov2_vitb14_inout \
  --batch_size 48 \
  --max_epochs 30 \
  --lr 5e-4 \
  --unfreeze_layers 0

📝 Notes

The reported results are obtained using the configuration above.
Checkpoints are saved after each epoch under ./experiments.
Training can be resumed with --resume /path/to/checkpoint.pt.
Small performance variations may occur due to GPU type, CUDA kernels, and multi-worker dataloading.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
assets		assets
checkpoint		checkpoint
data		data
data_prep		data_prep
depthany		depthany
gazeloom		gazeloom
onnx		onnx
scripts		scripts
simdino		simdino
simdinov2		simdinov2
README.md		README.md
fig1.png		fig1.png
fig2.png		fig2.png
fig3.png		fig3.png
fig4.png		fig4.png
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

⚡ GazeLoom ⚡

Train Driver Gaze Estimation Framework

🚀 About

Highlights

🗂️ Data Processing

👀 GazeFollow

🎥 VideoAttentionTarget

🧒 ChildPlay

🛒 GOO-Real

🌊 Depth Map Extraction

Pretrained Weights

Extract Depth Maps

🏋️ Train

👀 GazeFollow

🔁 Resume Training

🔓 Backbone Fine-tuning

⚙️ Main Arguments

🎨 Visualization

Run Visualization

Arguments

Output Files

🧪 Evaluation

👀 GazeFollow

🎥 VideoAttentionTarget

🧒 ChildPlay

🛒 GOO-Real

📊 Example Output

📸 Visuals

✨ Key Features

🧩 Overall Framework

🎯 Stage 1 — Feature Extraction

📐 Stage 2 — Geometry Guidance

Head Branch

Depth Branch

🔗 Stage 3 — Fusion & Prediction

🔍 Module Details

🧩 MGG — Multi-modal Geometry Guidance

Input Sources

Core Functions

🔗 CGF — Cross-modal Gating Fusion

Mechanism

Advantages

🧠 Architecture Overview

📊 Datasets & Results

⚙️ Installation

🔁 Reproducibility

🧪 Training Environment

🎲 Random Seeds

⚙️ Default Training Configuration

📊 Dataset-specific Schedule

🔓 Backbone Fine-tuning

📝 Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages