Official repository for paper: "Towards Unified Vision Language Models for Forest Ecological Analysis in Earth Observation" (AAAI-26 AI4ES Workshop)
We present REO-Instruct, a large-scale EO benchmark designed for both descriptive (generation) and scientific regression tasks, with a cognitively interpretable logical chain for forest ecological analysis.
Vision-Language Models (VLMs) show strong perception and reasoning abilities, but scientific regression in EO brings specific challenges:
-
🔢 Token-based outputs vs. continuous targets
VLMs generate discrete tokens, while we need precise continuous variables (e.g., AGB in Mg/ha). -
🎯 Semantic loss vs. numeric loss
Training objectives favor fluent language and semantic coherence, not numeric accuracy. -
🧱 Error accumulation in multi-token numbers
Numbers are split into multiple tokens; one wrong token can corrupt the full value. -
🧬 Different feature needs for description vs. regression
Human-readable descriptions rely on salient visual patterns, but scientific regression depends on subtle spectral/spatial signals invisible to the naked eye.
REO-Instruct is built exactly around these pain points: to test and drive regression-aware VLMs for EO.
A central design of REO-Instruct is a logical chain that links semantic understanding and quantitative prediction in a forest ecological scenario:
Human activity → Land-cover classification → Ecological patch counting → Above-Ground Biomass (AGB) regression
-
Human Activity (VQA-style)
Detects anthropogenic structures and activities (urban, agriculture, infrastructure) and their ecological impacts. -
Land-Cover Classification
Assigns each patch to professional land-cover classes (e.g., closed forest, cropland), based on Copernicus Global Land Cover. -
Ecological Patch Counting (Regression)
Estimates the number of distinct ecological patches within the tile, reflecting fragmentation and biodiversity. -
AGB Regression (Scientific Regression)
Predicts Above-Ground Biomass in Mg/ha, linking EO signals and ecological structure to biophysical quantities.
This chain ensures that all tasks are:
- Derivable from EO inputs,
- Logically coupled (not arbitrary multi-tasking),
- Scientifically interpretable for forest monitoring.
To be useful for scientific VLMs, REO-Instruct follows explicit EO data construction principles:
-
Sufficient & Necessary Modalities
- Multispectral (Sentinel-2 L2A, 13 bands, 10 m)
- SAR (ALOS-2 PALSAR-2, HH & HV, 25 m)
- RGB extracted from Sentinel-2 ([4, 3, 2])
These modalities jointly capture canopy structure, vegetation status, and land use.
-
Balanced, Diverse, Representative
- Wide coverage of land-cover types
- Broad AGB range
- Varied degrees of human influence
- Different geographic/ecological regions
→ Aims at generalizable models rather than overfitting to a few ecozones.
-
Strict Spatial Alignment
- Multispectral, SAR, and RGB patches are co-registered.
- Each sample corresponds to a 25 × 25 pixel patch (~250 m × 250 m), ensuring consistent observation units across sensors.
-
Scientifically Anchored Labels
- EO imagery is derived from the AGBD dataset (2019–2020).
- AGB values are consistent with established biomass mapping efforts.
Dataset scale
- 1.6M image–text pairs for training
- ~20K pairs for validation
- ~36K pairs for testing
Modalities per sample:
- ✅ RGB
- ✅ Multispectral (Sentinel-2)
- ✅ SAR (ALOS-2 PALSAR-2, HH & HV)
REO-Instruct’s text annotations are not generic captions; they are domain-aware, task-aware and follow clear rules to reduce noise in regression:
-
Task-aligned & derivable from EO data
Each question/description is designed so that the answer can be inferred from the EO patch and associated data. -
Logically coupled across tasks
Text for human activity, land cover, patch counting, and AGB is mutually consistent and respects the logical chain. -
Clear, concise, professionally structured
Slight textual perturbations can drastically change regression outputs in LLMs; we therefore avoid noisy, redundant or stylistically ambiguous descriptions.
-
Land-Cover Descriptions
- Based on Copernicus Global Land Cover classes
- Over 20 categories; distribution documented in the paper (Figure 3)
-
Ecological Patch Counting
- Patch = continuous land cover unit with distinct ecological traits
- Counts describe fragmentation and habitat complexity
-
VQA for Human Activity Monitoring
- Q&A about urban areas, roads, croplands, and their potential ecological impact
- Supports analysis of human–environment interactions (e.g., deforestation, urban sprawl)
-
AGB Values (Regression Target)
- Ground-truth AGB in Mg/ha
- Enables direct supervision for numeric regression and joint training with text.
-
Model-assisted generation with ChatGPT-4o
- Carefully designed prompts control:
- Land-cover label sets
- Question templates (100+ templates)
- Separation of regression-related vs. descriptive content
- Carefully designed prompts control:
-
Automatic Consistency Checks
- Scripts cross-check annotations with trusted sources, e.g. Copernicus land cover.
- Obvious mismatches are corrected or removed.
-
Expert Manual Review (Val/Test)
- Senior experts manually inspect validation and test annotations.
- Ensures scientific correctness and consistent terminology for benchmarking.
This two-stage pipeline (automatic + expert) aims to deliver a benchmark that is both scalable and trustworthy.
REO-Instruct provides standardized evaluation splits and protocols for four representative tasks:
-
Land-Cover Classification (Generation)
- Metric: Overall Accuracy, Macro-averaged Precision/Recall/F1.
-
Ecological Patch Counting (Regression)
- Metrics: RMSE, MAE, R², and discretized Overall Accuracy.
-
Human Activity Monitoring (VQA)
- Metric: Answer Accuracy (%).
-
Above-Ground Biomass (AGB) Regression (Scientific Regression)
- Metrics: RMSE, MAE, R².
We benchmark both domain-specific VLMs (e.g., GeoChat, LHRS-Bot) and general-purpose VLMs (e.g., LLaVA, Qwen2-VL, ChatGPT-4o), as well as classical EO models (e.g., U-Net on multispectral inputs).
-
Current VLMs can handle content understanding & VQA, but
- show low accuracy in fine-grained land-cover classification,
- and poor numeric performance in regression tasks.
-
Even with fine-tuning, AGB regression with VLMs often underperforms a dedicated U-Net on multispectral imagery.
-
Many VLMs refuse or fail to answer numeric queries reliably, resulting in many “unanswerable” cases.
These results indicate that scientific VLMs for EO need regression-aware architectures and better exploitation of multimodal EO inputs, beyond straightforward instruction tuning.
📥 Download All (extraction code: 8efp)
- Training: 📥 Download (extraction code:
5vw6) - Validation: 📥 Download (extraction code:
gwjy) - Test: 📥 Download (extraction code:
y7gv)
- Training: Stage 1 · Stage 2
- Validation: 📥 Download
- Test: 📥 Download
- 🤗 Hugging Face Dataset: Coming Soon
- 🗺️ TorchGeo Integration (with help of outstanding colleague Adam J. Stewart): In Progress
We gratefully acknowledge the authors of AGBD: A Global-scale Biomass Dataset for providing high-quality biomass-related imagery data:
- Ghjulia Sialelli, Torben Peters, Jan D. Wegner, Konrad Schindler
Please also consider citing the following work if you use REO-Instruct data or biomass-related imagery:
@article{xue2025towards,
title = {Towards Unified Vision Language Models for Forest Ecological Analysis in Earth Observation},
author = {Xizhe Xue and Xiaoxiang Zhu},
journal = {Proceedings of the AAAI Conference on Artificial Intelligence, AI4ES Workshop},
year = {2025}
}
