[MICCAI‘25] MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment

Abstract

Dermatological diagnosis represents a complex multimodal challenge that requires integrating visual features with specialized clinical knowledge. We introduce MAKE, a Multi-Aspect Knowledge-Enhanced vision-language pretraining framework for zero-shot dermatological tasks. Our approach addresses the limitations of existing vision-language models by decomposing clinical narratives into knowledge-enhanced subcaptions, connecting subcaptions with relevant image features, and adaptively prioritizing different knowledge aspects. Through pretraining on 403,563 dermatological image-text pairs, MAKE significantly outperforms state-of-the-art VLP models on eight datasets across zero-shot skin disease classification, concept annotation, and cross-modal retrieval tasks.

Resources

📦 Code and pretrained model will be released in this repository
🔍 Pretraining data will be available at Derm1M Repository

Update

20/06/2025: Released the MAKE checkpoint and the evaluation pipeline.
25/06/2025: Released knowledge-extraction details.
06/09/2025: Training code release

⚙️ Environment Preparation

Setup conda environment (recommended).

conda create -n MAKE python=3.9.20
conda activate MAKE

Clone MAKE repository and install requirements

git clone [email protected]:SiyuanYan1/MAKE.git
cd MAKE
pip install -r requirements.txt

🚀 Quick start

Our model is available on Hugging Face for easy access. Here we provide a simple example demonstrating zero-shot disease classification using MAKE in just 43 lines of code.

Data Preparation

Download our downstream tasks dataset from Google Drive, unzip it, and place the contents in the data/ directory.

Once downloaded, your project directory should be organized as follows:

Expected Project Structure

├── concept_annotation
│   ├── automatic_concept_annotation.py
│   ├── dataset.py
│   ├── infer.py
│   ├── model.py
│   ├── term_lists
│   └── utils.py
├── data
│   ├── pretrain
│   ├── derm7pt
│   ├── F17K
│   ├── PAD
│   ├── SD-128
│   ├── skin_cap
│   ├── skincon
│   └── SNU
├── README.md
├── requirements.txt
├── script
│   ├── concept_annotation_script_open_clip.sh
│   └── test.sh
└── src
    ├── infer.py
    ├── main.py
    ├── open_clip
    ├── open_clip_train
    └── test.py

Training

Script: script/pretrain.sh
Training data: Our training data will be released in October
Hardware requirements: We used one NVIDIA H200 GPU for training (occupied GPU memory during training: 140GB, the training time is around 6h). You can use gradient accumulation during training by setting the accum-freq parameter.

python src/main.py \
       --zeroshot-frequency 1 \
       --train-data=data/pretrain/MAKE_training.csv \
       --val-data=data/pretrain/MAKE_valid.csv \
       --csv-caption-key truncated_caption \
       --csv-label-key label \
       --aug-cfg scale="(0.4, 1.0)" color_jitter="(0.32, 0.32, 0.32, 0.08)" color_jitter_prob=0.8 gray_scale_prob=0.2 \
       --csv-img-key filename \
       --warmup 1500 \
       --wd=0.1 \
       --batch-size 2048 \
       --lr=1e-4 \
       --epochs=15 \
       --workers=32 \
       --model ViT-B-16 \
       --pretrained OPENAI \
       --logs logs/ \
       --local-loss \
       --grad-checkpointing \
       --dataset-resampled \
       --lambda_m 1.0 \ # weights of MKCL(Multi-aspect Knowledge-Image Contrastive Learning) loss
       --lambda_s 0.7 \ # weights of local alignment loss
       --MKCL \         # Enable MKCL loss
       --subcaptions \  # Using splitted subcaptions
       --use_disease_specific_weight \ # Enable local alignment loss
       --num_subcaptions 8 \ # The number of subcaption used
       --save-frequency 15

Evaluation

Zero-Shot Disease Classification

Metric: Accuracy
Note: We use specialized prompt templates (OPENAI_SKIN_TEMPLATES in src/open_clip/zero_shot_metadata.py:120) that are optimized for dermatological contexts, providing diverse phrasings to improve robustness across different linguistic expressions of medical concepts.

python src/test.py \
    --val-data=""  \                           
    --dataset-type "csv" \
    --batch-size 2048 \                       # Batch size for inference
    --zeroshot-eval1=data/PAD/MAKE_PAD.csv \         # PAD dataset csv
    --zeroshot-eval2=data/F17K/MAKE_F17K.csv \       # F17K dataset csv
    --zeroshot-eval3=data/SNU/MAKE_SNU.csv \         # SNU dataset csv
    --zeroshot-eval4=data/SD-128/MAKE_SD-128.csv \   # SD128 dataset csv
    --csv-label-key label \                   # Column name for class labels in CSV
    --csv-img-key filename \                  # Column name for image filenames in CSV
    --model 'hf-hub:xieji-x/MAKE'           # MAKE checkpoint from Hugging Face Hub

Concept Annotation

Metric: AUROC

# Clinical concept annotation (SkinCon dataset)
python concept_annotation/automatic_concept_annotation.py \
    --model_api open_clip_hf-hub:xieji-x/MAKE \
    --data_dir "data/skincon" \                     # Directory containing clinical images
    --batch_size 32 \                              # Batch size for processing images
    --concept_list "data/skincon/concept_list.txt" \    # Clinical concept names (32 concepts)
    --concept_terms_json "concept_annotation/term_lists/ConceptTerms.json"  # JSON mapping concepts to synonyms

# Dermascopic concept annotation (Derm7pt dataset)
python concept_annotation/automatic_concept_annotation.py \
    --model_api open_clip_hf-hub:xieji-x/MAKE \
    --data_dir "data/derm7pt" \                     # Directory containing dermascopic images
    --batch_size 32 \
    --concept_list "data/derm7pt/concept_list.txt" \    # Dermascopic concept names (7-point checklist)
    --concept_terms_json "concept_annotation/term_lists/ConceptTerms.json"

Cross-Modality Retrieval

Metrics: Recall@10, Recall@50, Recall@100

python src/main.py \
    --val-data="data/skin_cap/skin_cap_meta.csv" \    # SkinCap dataset metadata
    --dataset-type "csv" \                            # Dataset format
    --batch-size=2048 \                               # Batch size for retrieval
    --csv-img-key filename \                          # Column name for image filenames
    --csv-caption-key 'caption_zh_polish_en' \        # Column with EN captions
    --model 'hf-hub:xieji-x/MAKE'                    # MAKE checkpoint from Hugging Face

Knowledge Extraction

Please refer to prompt.md for more details.

Citation

If you find our work useful in your research, please consider citing our papers:

@misc{yan2025makemultiaspectknowledgeenhancedvisionlanguage,
      title={MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment}, 
      author={Siyuan Yan and Xieji Li and Ming Hu and Yiwen Jiang and Zhen Yu and Zongyuan Ge},
      year={2025},
      eprint={2505.09372},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.09372}, 
}

@misc{yan2025derm1mmillionscalevisionlanguagedataset,
      title={Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology}, 
      author={Siyuan Yan and Ming Hu and Yiwen Jiang and Xieji Li and Hao Fei and Philipp Tschandl and Harald Kittler and Zongyuan Ge},
      year={2025},
      eprint={2503.14911},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.14911}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[MICCAI‘25] MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment

Abstract

Resources

Update

⚙️ Environment Preparation

🚀 Quick start

Data Preparation

Training

Evaluation

Zero-Shot Disease Classification

Concept Annotation

Cross-Modality Retrieval

Knowledge Extraction

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
assets		assets
concept_annotation		concept_annotation
script		script
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

SiyuanYan1/MAKE

Folders and files

Latest commit

History

Repository files navigation

[MICCAI‘25] MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment

Abstract

Resources

Update

⚙️ Environment Preparation

🚀 Quick start

Data Preparation

Training

Evaluation

Zero-Shot Disease Classification

Concept Annotation

Cross-Modality Retrieval

Knowledge Extraction

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages