Skip to content

thailand88/GAZELOOM

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

71 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

⚑ GazeLoom ⚑

Train Driver Gaze Estimation Framework


A lightweight and robust driver gaze estimation system powered by self-supervised learning and geometry guidance.


πŸš€ About

GazeLoom is a driver gaze estimation framework designed for intelligent traffic safety and human-vehicle interaction.

It leverages multi-modal geometric guidance and self-supervised feature extraction to accurately predict driver gaze points in 3D space.

Highlights

  • πŸ”Ή Lightweight Model β€” only 4.97M parameters
  • πŸ”Ή High-Precision Estimation β€” joint prediction of head pose and eye movement
  • πŸ”Ή Real-Time Inference β€” suitable for in-vehicle edge deployment
  • πŸ”Ή Strong Generalization β€” robust to lighting changes, occlusions, and pose variations
  • πŸš€ ONNX Deployment β€” Run onnx.py to export and deploy GazeLoom in ONNX format. Google Drive: Download

πŸ—‚οΈ Data Processing

Before training or evaluation, please download the required datasets and run the corresponding preprocessing scripts.

The preprocessing scripts convert raw annotations into a unified JSON format for each split, including head bounding boxes, gaze points, in/out labels, and metadata required by GazeLoom.

Dataset Download Preprocessing Script
πŸ‘€ GazeFollow Download data_prep/preprocess_gazefollow.py
πŸŽ₯ VideoAttentionTarget Download data_prep/preprocess_vat.py
πŸ§’ ChildPlay Download data_prep/preprocess_childplay.py
πŸ›’ GOO-Real Download data_prep/preprocess_goo_real.py

πŸ‘€ GazeFollow

python data_prep/preprocess_gazefollow.py \
  --data_path /path/to/gazefollow/data_new

πŸŽ₯ VideoAttentionTarget

python data_prep/preprocess_vat.py \
  --data_path /path/to/videoattentiontarget

πŸ§’ ChildPlay

python data_prep/preprocess_childplay.py \
  --data_path /path/to/childplay

πŸ›’ GOO-Real

python data_prep/preprocess_goo_real.py \
  --data_path /path/to/goo_real

After preprocessing, each dataset directory will contain JSON annotation files that can be directly used for GazeLoom training and evaluation.

🌊 Depth Map Extraction

GazeLoom uses Depth Anything V2 to generate monocular depth maps as geometric guidance.

Pretrained Weights

Download official Depth Anything V2 checkpoints:

Model Checkpoint
Depth-Anything-V2-Small Download
Depth-Anything-V2-Base Download
Depth-Anything-V2-Large Download

Place the downloaded checkpoint under:

checkpoints/
└── depth_anything_v2_vitl.pth

Extract Depth Maps

python depthany/depth.py \
  --img_path /path/to/images \
  --outdir /path/to/depth \
  --encoder vitl \
  --checkpoint checkpoints/depth_anything_v2_vitl.pth \
  --input_size 518 \
  --grayscale

For GazeLoom datasets, the generated depth maps should preserve the same relative image paths:

dataset_root/
β”œβ”€β”€ gazefollow/
β”‚   └── xxx.jpg
β”œβ”€β”€ videoattentiontarget/
β”‚   └── xxx.jpg
└── depth/
    └── xxx.png

πŸ‹οΈ Train

We provide training scripts in scripts/ for training GazeLoom with the SimDINOv2 backbone.

Before running the training script, please:

  • πŸ“₯ Download the dataset and run the preprocessing script following the Data Processing section.
  • 🌊 Prepare the extracted depth maps if geometry-guided training is enabled.
  • πŸ“Š Optionally install wandb for metric logging:
pip install wandb

By default, checkpoints are saved to ./experiments.
You can use --ckpt_save_dir to customize the checkpoint directory.


πŸ‘€ GazeFollow

Train GazeLoom with the SimDINOv2 ViT-B/14 backbone:

python scripts/train_gazefollow.py \
  --data_path /path/to/gazefollow/data_new \
  --model gazeloom_cgf_simdinov2_vitb14_inout \
  --exp_name train_gazeloom_simdinov2_vitb14_gazefollow \
  --batch_size 48 \
  --max_epochs 30 \
  --lr 5e-4

Train GazeLoom with the SimDINOv2 ViT-L/14 backbone:

python scripts/train_gazefollow.py \
  --data_path /path/to/gazefollow/data_new \
  --model gazeloom_cgf_simdinov2_vitl14_inout \
  --exp_name train_gazeloom_simdinov2_vitl14_gazefollow \
  --batch_size 32 \
  --max_epochs 30 \
  --lr 5e-4

πŸ” Resume Training

Resume training from a saved checkpoint:

python scripts/train_gazefollow.py \
  --data_path /path/to/gazefollow/data_new \
  --model gazeloom_cgf_simdinov2_vitb14_inout \
  --resume /path/to/checkpoint.pt \
  --exp_name resume_gazeloom_simdinov2_vitb14

πŸ”“ Backbone Fine-tuning

By default, the backbone is frozen.
To fine-tune the last several SimDINOv2 transformer blocks, use --unfreeze_layers:

python scripts/train_gazefollow.py \
  --data_path /path/to/gazefollow/data_new \
  --model gazeloom_cgf_simdinov2_vitb14_inout \
  --unfreeze_layers 2 \
  --exp_name finetune_gazeloom_simdinov2_vitb14

βš™οΈ Main Arguments

Argument Description
--model Model name, e.g. gazeloom_cgf_simdinov2_vitb14_inout
--data_path Path to the preprocessed dataset
--ckpt_save_dir Directory for saving checkpoints
--exp_name Experiment name
--batch_size Training batch size
--max_epochs Number of training epochs
--lr Learning rate for trainable heads
--resume Path to checkpoint for resuming training
--unfreeze_layers Number of final backbone layers to fine-tune

🎨 Visualization

We provide a visualization script in scripts/visualize.py for qualitative analysis of GazeLoom predictions.

The script automatically detects faces using RetinaFace, predicts gaze heatmaps for each detected person, and saves the visualization results, including:

  • 🟧 detected head bounding boxes
  • πŸ”΄ predicted gaze target points
  • 🟒 high-response heatmap regions
  • πŸ“ gaze direction lines
  • 🌈 heatmap overlay images
  • πŸ“„ JSON prediction results

Run Visualization

python scripts/visualize.py \
  --image_dir test \
  --depth_dir test_depth \
  --ckpt_path checkpoints/epoch_14.pt \
  --model_name gazeloom_cgf_simdinov2_vitb14_inout \
  --output_dir output

Arguments

Argument Description
--image_dir Directory containing input images
--depth_dir Directory containing extracted depth maps
--ckpt_path Path to the trained model checkpoint
--model_name Model architecture used for inference
--output_dir Directory for saving visualization results
--inout_threshold Threshold for filtering out-of-frame gaze predictions
--heatmap_threshold Threshold for highlighting high-response gaze regions

Output Files

After running the script, results will be saved under the output directory:

output/
β”œβ”€β”€ image_result.jpg
β”œβ”€β”€ image_heatmap_0.jpg
β”œβ”€β”€ image_overlay_0.jpg
└── gaze_predictions.json

πŸ§ͺ Evaluation

We provide evaluation scripts in scripts/ to validate GazeLoom on standard gaze-target benchmarks.

Before evaluation, please make sure that:

  • πŸ“₯ The dataset has been downloaded.
  • πŸ—‚οΈ The preprocessing script has been executed.
  • 🌊 Depth maps have been generated if the model uses geometry guidance.
  • πŸ“¦ The pretrained checkpoint has been downloaded.

πŸ‘€ GazeFollow

Evaluate GazeLoom on the GazeFollow test split:

python scripts/eval_gazefollow.py \
  --data_path /path/to/gazefollow/data_new \
  --model_name gazeloom_cgf_simdinov2_vitl14 \
  --ckpt_path /path/to/checkpoint.pt \
  --batch_size 128

The script reports:

Metric Description
AUC ↑ Area under the gaze heatmap ROC curve
Avg L2 ↓ Average L2 distance between prediction and ground-truth gaze points
Min L2 ↓ Minimum L2 distance to the closest ground-truth annotation

πŸŽ₯ VideoAttentionTarget

Evaluate GazeLoom on VideoAttentionTarget:

python scripts/eval_vat.py \
  --data_path /path/to/videoattentiontarget \
  --model_name gazeloom_cgf_simdinov2_vitl14_inout \
  --ckpt_path /path/to/checkpoint.pt \
  --batch_size 64

For VideoAttentionTarget, the _inout model is recommended because the dataset includes both in-frame and out-of-frame gaze targets.


πŸ§’ ChildPlay

Evaluate GazeLoom on ChildPlay:

python scripts/eval_childplay.py \
  --data_path /path/to/childplay \
  --model_name gazeloom_cgf_simdinov2_vitl14_inout \
  --ckpt_path /path/to/checkpoint.pt \
  --batch_size 64

πŸ›’ GOO-Real

Evaluate GazeLoom on GOO-Real:

python scripts/eval_goo_real.py \
  --data_path /path/to/goo_real \
  --model_name gazeloom_cgf_simdinov2_vitl14 \
  --ckpt_path /path/to/checkpoint.pt \
  --batch_size 64

πŸ“Š Example Output

Running on cuda
Evaluating: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 100/100
AUC: 0.967
Avg L2: 0.112
Min L2: 0.079

The generated visualizations can be used to inspect gaze direction, predicted attention targets, and model behavior under different driving scenarios.

πŸ“Έ Visuals


✨ Key Features

Feature Description
🧠 Geometry-Guided Learning Combines semantic and geometric priors for robust gaze estimation
βš™οΈ Self-Supervised Backbone Reduces dependency on large-scale labeled data
πŸš— Driver-Centric Design Optimized for railway and in-cabin driving environments
⚑ Lightweight Deployment 4.97M parameters with real-time edge inference capability

🧩 Overall Framework

GazeLoom is a lightweight and geometry-guided framework for 3D driver gaze estimation in railway driving scenarios.

It consists of three key stages:

  1. Feature Extraction
  2. Geometry Guidance
  3. Fusion & Prediction

🎯 Stage 1 β€” Feature Extraction

A pre-trained self-supervised backbone, SimDINOv2, is used to encode driving scene images and generate robust global visual representations.

To enhance semantic understanding and spatial perception, GazeLoom incorporates multiple auxiliary cues:

  • 🌫️ Depth Map β€” provides 3D structural priors
  • ✨ DISM Saliency Map β€” highlights attention-relevant visual regions
  • πŸ‘€ Head Pose Features β€” offer geometric priors of gaze orientation

These features are fed into the Multi-modal Geometry Guidance module for semantic-geometric fusion.


πŸ“ Stage 2 β€” Geometry Guidance

The Multi-modal Geometry Guidance module, abbreviated as MGG, enhances 3D spatial reasoning and structural perception.

Head Branch

The head branch uses head feature maps with pseudo-heatmap supervision to explicitly model local geometric constraints of gaze direction.

Depth Branch

The depth branch fuses depth maps and DISM saliency maps to inject global 3D structural priors.

Together, these branches generate a structure-consistent visual-spatial representation for downstream gaze prediction.


πŸ”— Stage 3 β€” Fusion & Prediction

The Cross-modal Gating Fusion module, abbreviated as CGF, adaptively integrates semantic and spatial features through a gating attention mechanism.

After fusion, the model performs two prediction tasks:

  • 🎯 In-Out Gaze Classification
  • πŸ”₯ Gaze Heatmap Generation

The entire model is trained using a multi-task joint optimization framework, improving robustness, generalization, and real-time performance.


πŸ” Module Details

🧩 MGG β€” Multi-modal Geometry Guidance

MGG integrates geometric priors from multiple modalities to enhance robustness under complex driving conditions.

Input Sources

  • Facial landmarks
  • Head pose
  • Eye-region depth features

Core Functions

  • Builds multi-modal geometric representations
  • Captures spatial relationships between facial structure and orientation
  • Applies geometry consistency constraints
  • Uses a lightweight transformer to model spatial dependencies

πŸ’‘ MGG helps GazeLoom maintain high precision under lighting changes, head rotations, and partial occlusions.


πŸ”— CGF β€” Cross-modal Gating Fusion

CGF introduces a gating mechanism to dynamically balance semantic and geometric features.

Mechanism

  • Learns adaptive weights between geometry and semantic branches
  • Prevents over-reliance on a single modality
  • Enables geometry-constrained cross-modal fusion

Advantages

  • Improves semantic coherence
  • Enhances spatial continuity
  • Strengthens generalization and stability

βš™οΈ CGF improves inter-modal cooperation, making GazeLoom accurate and reliable in real-world in-cabin scenarios.


🧠 Architecture Overview

The GazeLoom architecture estimates 3D driver gaze points through the following pipeline:

  1. Camera Input
  2. Face Landmark Extraction
  3. Head Pose Estimation
  4. Eye Gaze Vector Modeling
  5. Multi-modal Geometry Guidance
  6. Cross-modal Gating Fusion
  7. 3D Gaze Point Prediction

πŸ“Š Datasets & Results

Dataset AUC ↑ L2 ↓ AP ↑
GazeFollow 0.967 0.079 -
VideoAttentionTarget 0.953 0.098 0.942

GazeLoom achieves strong performance across multiple benchmarks while maintaining a lightweight architecture.


βš™οΈ Installation

Clone the repository:

git clone https://github.com/yourname/GAZELOOM.git

cd GAZELOOM


πŸ” Reproducibility

To facilitate reproducibility, we provide the training seeds, hyperparameters, optimizer settings, and learning-rate schedules used in our experiments.

⚠️ Due to hardware differences, CUDA/cuDNN behavior, and dataloader randomness, exact numerical results may slightly vary across different environments.

πŸ§ͺ Training Environment

Item Configuration
🧠 Framework PyTorch 2.2.0
βš™οΈ CUDA CUDA 12.1
πŸ–₯️ GPU NVIDIA GPU
πŸ“¦ Backbone SimDINOv2
πŸ“ Input Size 448 Γ— 448
πŸ”₯ Heatmap Size 64 Γ— 64

🎲 Random Seeds

All experiments are initialized with fixed random seeds:

random.seed(0)
np.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)

βš™οΈ Default Training Configuration

Setting Value
Optimizer Adam
Batch Size 48
Max Epochs 30
Base Learning Rate 5e-4
Backbone Learning Rate 1e-5
Scheduler CosineAnnealingLR
Minimum LR 1e-7
Weight Decay Not used by default
Backbone Frozen by default
Drop Path 0.1
CGF Groups 8

πŸ“Š Dataset-specific Schedule

Dataset Initialization Epochs Model
πŸ‘€ GazeFollow From SimDINOv2 backbone 30 gazeloom_cgf_simdinov2_vitb14_inout
πŸŽ₯ VideoAttentionTarget Fine-tuned from GazeFollow checkpoint 8 gazeloom_cgf_simdinov2_vitb14_inout
πŸ§’ ChildPlay Fine-tuned from GazeFollow checkpoint 3 gazeloom_cgf_simdinov2_vitb14_inout
πŸ›’ GOO-Real Fine-tuned from GazeFollow checkpoint 3 gazeloom_cgf_simdinov2_vitb14

πŸ”“ Backbone Fine-tuning

By default, the SimDINOv2 backbone is frozen.
If needed, the last N Transformer blocks can be unfrozen using:

--unfreeze_layers N

For example:

python scripts/train_gazefollow.py \
  --data_path /path/to/gazefollow/data_new \
  --model gazeloom_cgf_simdinov2_vitb14_inout \
  --batch_size 48 \
  --max_epochs 30 \
  --lr 5e-4 \
  --unfreeze_layers 0

πŸ“ Notes

  • The reported results are obtained using the configuration above.
  • Checkpoints are saved after each epoch under ./experiments.
  • Training can be resumed with --resume /path/to/checkpoint.pt.
  • Small performance variations may occur due to GPU type, CUDA kernels, and multi-worker dataloading.

About

GAZE

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%