ChEmREF: Evaluating Language Model Readiness for Chemical Emergency Response

This repository contains the code and data for the paper "ChEmREF: Evaluating Language Model Readiness for Chemical Emergency Response" by Risha Surana, Qinyuan Ye, and Swabha Swayamdipta from the University of Southern California.

Framework Overview

Figure 1-2: Overview of the ChEmREF evaluation framework showing the three main tasks: (1) Chemical Information Representation, (2) Emergency Response Generation, and (3) Domain Knowledge Question Answering.

Abstract

Emergency responders managing hazardous material (HazMat) incidents face critical, time-sensitive decisions, manually navigating extensive chemical guidelines. We investigate whether today's language models can assist responders by rapidly and reliably understanding critical information, identifying hazards, and recommending appropriate action. We introduce the Chemical Emergency Response Evaluation Framework (ChEmREF), a new benchmark comprising 1,035 HazMat scenarios from the Emergency Response Guidebook and the PubChem Database. ChEmREF is organized into three tasks: (1) structured and unstructured information representation between structured and unstructured forms (e.g., converting "C₂H₅O" to "ethanol"), (2) emergency response generation (e.g., recommending appropriate evacuation distances) and (3) domain knowledge question answering from chemical safety and certification exams. Our best automated models received an exact match of 68.0% on unstructured HazMat chemical representation tasks, with a LLM judge score of 52.7% on incident response recommendations, and a multiple-choice accuracy of 63.9% on HazMat examinations. These findings suggest that while language models show potential to assist emergency responders in various tasks, they require careful human oversight due to their current limitations.

🚀 Quick Start

Prerequisites

Python 3.8 or higher
CUDA-compatible GPU (optional, for faster inference)

Installation

Clone this repository:

git clone https://github.com/rishasurana/ChEmREF.git
cd ChEmREF

Create a virtual environment:

python -m venv chemref_env
source chemref_env/bin/activate  # On Windows: chemref_env\Scripts\activate

Install required dependencies:

pip install -r requirements.txt

Environment Setup

Create a .env file in the root directory with your API keys:

OPENAI_API_KEY=your_openai_api_key
HUGGINGFACE_TOKEN=your_huggingface_token

📊 Dataset

The ChEmREF benchmark consists of 1,035 HazMat scenarios organized into three main evaluation tasks:

Task 1: Information Representation (1,035 chemicals)

Structured to Unstructured: Converting chemical formulas/IUPAC names to common names (e.g., "C₂H₅O" → "ethanol")
Unstructured to Structured: Converting common names to chemical formulas (e.g., "ethanol" → "C₂H₅O")
Data Source: data/task1_representation/hazmat_1035.csv
Evaluation: Exact match accuracy on bidirectional conversions

Task 2: Emergency Response Generation (100 chemical scenarios)

Evacuation distance recommendations
Safety protocol suggestions
Incident-specific response procedures
Fire/explosion hazard management
Health hazard assessment
Protective equipment recommendations
Data Source: data/task2_emergency_response/hazmat_100.csv
Evaluation: LLM judge scoring for response quality and safety

Task 3: Domain Knowledge Question Answering (617+ questions)

Commercial Driver's License (CDL) HazMat certification exams
Chemical safety certification exam questions
HazMat handling and transportation procedures
Emergency response protocols
Data Source: data/task3_domain_qa/hazmat_cdl_mc.csv, hazmat_quizzes_combined.csv
Evaluation: Multiple-choice accuracy

Supporting Data

Emergency Response Guidebook (ERG): Complete 2024 ERG data in JSON and CSV formats
PubChem Integration: Chemical property data from PubChem Database
Raw Processing Files: Intermediate datasets for reproducibility

For detailed information about all datasets, see DATA_OVERVIEW.md.

🗂️ Repository Structure

ChEmREF/
├── data/                           # Dataset files
│   ├── task1_representation/       # Information representation tasks
│   │   └── hazmat_1035.csv        # Main chemical representation dataset (1,035 chemicals)
│   ├── task2_emergency_response/   # Emergency response generation
│   │   └── hazmat_100.csv         # Emergency response scenarios (100 chemicals)
│   ├── task3_domain_qa/           # Domain knowledge Q&A
│   │   ├── hazmat_cdl_mc.csv      # Commercial Driver's License exam questions
│   │   ├── hazmat_quizzes_combined.csv # Combined HazMat certification quizzes
│   │   ├── quiz_reference.csv     # Reference quiz data
│   │   └── Hazmat_Awareness_Practice_Test.csv # Additional practice test questions
│   └── build/                     # Data processing and source files
│       ├── build-ERG/             # Emergency Response Guidebook processing
│       │   ├── ERG_2024_Guide_Materials.json # Structured ERG data
│       │   ├── erg_table_data_full.csv # Complete ERG table data
│       │   └── [ERG Excel files]  # Raw ERG data by section (Yellow, Blue, Orange, Green)
│       ├── build-hazmat-exams/    # HazMat exam data processing
│       │   ├── hazmat-cdl/        # CDL-specific exam data
│       │   │   ├── cdl.ipynb      # CDL data processing notebook
│       │   │   ├── hazmat_cdl_mc.csv # CDL multiple choice questions
│       │   │   └── hazmat_100s.csv # CDL practice questions (100-item sets)
│       │   ├── hazmat_quizzes_proprofs/ # ProProfs quiz data
│       │   ├── build.ipynb        # Main exam data processing notebook
│       │   └── hazmat_quizzes_combined.csv # All quiz data combined
│       ├── build_erg.ipynb        # ERG data processing notebook
│       └── build_hazmat_data.ipynb # HazMat data compilation notebook
├── src/                          # Source code
│   └── run_evaluation.py         # Main evaluation script
├── scripts/                       # Execution and processing scripts
│   ├── evaluation/                # Model evaluation scripts
│   │   ├── run_t1.py             # Task 1: Chemical representation evaluation (all models)
│   │   ├── run_t1_1.py           # Task 1 subtask: IUPAC to common name
│   │   ├── run_t1_2.py           # Task 1 subtask: Formula to common name  
│   │   ├── run_t1_gpt4.py        # Task 1 evaluation with GPT-4
│   │   ├── run_t1_un_gpt4.py     # Task 1 unstructured evaluation with GPT-4
│   │   ├── run_t2_gpt4.py        # Task 2: Emergency response with GPT-4
│   │   ├── run_t3.py             # Task 3: Domain knowledge evaluation (all models)
│   │   └── run_t3_gpt4.py        # Task 3 evaluation with GPT-4
│   └── run_scripts/               # Shell scripts and supporting files
│       ├── gpt4-llm-judge/        # GPT-4 judge evaluation scripts
│       │   ├── gpt4_judge_test.py # LLM judge implementation
│       │   └── hazmat_100.csv     # Test data for judge evaluation
│       └── [shell scripts]        # Batch execution scripts (.sh files)
├── results/                       # Experimental results
│   ├── t1_evaluations/           # Task 1 evaluation results
│   ├── t2_evaluations/           # Task 2 evaluation results
│   ├── t3_evaluations/           # Task 3 evaluation results
│   └── outputs_final/            # Final model outputs and comparisons
├── configs/                      # Configuration files
│   └── evaluation.yaml           # Evaluation settings and model configurations
├── graphical-abstract.pdf        # Framework overview figure
├── requirements.txt              # Python dependencies
├── .env.template                 # Template for environment variables
├── DATA_OVERVIEW.md              # Detailed dataset documentation
└── README.md                     # This file

🔧 Usage

Quick Start Evaluation

# Run all tasks with GPT-4 (recommended)
python src/run_evaluation.py --task all --model gpt-4

# Run specific task
python src/run_evaluation.py --task task1 --model gpt-4
python src/run_evaluation.py --task task2 --model gpt-4  
python src/run_evaluation.py --task task3 --model gpt-4

# Run with open-source models
python src/run_evaluation.py --task task1 --model phi4
python src/run_evaluation.py --task task1 --model chemllm-7B

Individual Task Evaluation

Task 1: Chemical Information Representation

# Run chemical name/formula conversion evaluation
python scripts/evaluation/run_t1.py

# Run with GPT-4 specifically
python scripts/evaluation/run_t1_gpt4.py

# Run bidirectional conversion (structured ↔ unstructured)
python scripts/evaluation/run_t1_1.py  # IUPAC to Common Name
python scripts/evaluation/run_t1_2.py  # Formula to Common Name

Task 2: Emergency Response Generation

# Run emergency response evaluation with GPT-4
python scripts/evaluation/run_t2_gpt4.py

# Generate responses for specific hazard scenarios
python scripts/evaluation/run_t2_gpt4.py --scenario_type "Fire or Explosion Hazard"

Task 3: Domain Knowledge Q&A

# Run HazMat certification exam evaluation
python scripts/evaluation/run_t3.py

# Run with GPT-4 specifically
python scripts/evaluation/run_t3_gpt4.py

Data Processing

To recreate the datasets from raw sources:

# Process Emergency Response Guidebook data
jupyter notebook data/build/build_erg.ipynb

# Build comprehensive HazMat dataset
jupyter notebook data/build/build_hazmat_data.ipynb

# Process HazMat exam and certification data
jupyter notebook data/build/build-hazmat-exams/build.ipynb

# Process CDL-specific exam data
jupyter notebook data/build/build-hazmat-exams/hazmat-cdl/cdl.ipynb

Environment Setup

Copy the environment template:

cp .env.template .env

Edit .env with your API keys:

OPENAI_API_KEY=your_openai_api_key_here
HUGGINGFACE_TOKEN=your_huggingface_token_here

📋 Dependencies

Core Dependencies

torch>=1.13.0 - PyTorch for model implementations
transformers>=4.25.0 - Hugging Face Transformers library
vllm>=0.2.0 - vLLM for efficient inference with large models
openai>=1.0.0 - OpenAI API client for GPT models

Model Support

The codebase supports evaluation with:

OpenAI Models: GPT-4, GPT-3.5-turbo
Open Source Models:
- microsoft/phi-4 - Microsoft Phi-4 model
- AI4Chem/ChemLLM-7B-Chat - Chemistry-specialized language model
- meta-llama/Llama-3.1-70B-Instruct - Llama 3.1 70B
- m42-health/Llama3-Med42-70B - Medical/health specialized model
- microsoft/Phi-3-mini-4k-instruct - Phi-3 mini model

Data Processing

pandas>=1.5.0 - Data manipulation
numpy>=1.21.0 - Numerical computing
datasets>=2.8.0 - Hugging Face Datasets

Evaluation

scikit-learn>=1.1.0 - Metrics and evaluation
rouge-score>=0.1.2 - ROUGE metrics
bleurt>=0.0.2 - BLEURT evaluation

Utilities

tqdm>=4.64.0 - Progress bars
pyyaml>=6.0 - YAML configuration
python-dotenv>=0.19.0 - Environment variables

See requirements.txt for complete dependency list with exact versions.

📈 Results

Our evaluation reveals that current language models show promise but require significant improvements for reliable emergency response assistance:

Key Findings

Information Representation: 68.0% exact match accuracy on chemical name/formula conversion (GPT-4o)
Emergency Response: 52.7% LLM judge score on incident response recommendations (GPT-4o)
Domain Knowledge: 63.9% accuracy on HazMat certification exam questions (GPT-4o)

Model Performance Comparison

Model	Task 1: Translation (Unstructured EM %)	Task 2: Emergency Response (LLM Judge %)	Task 3: Exam (Acc %)	Overall Avg
GPT-4o	68.0	52.7	63.9	71.4
Llama-3.1 (70B)	67.1	50.7	60.0	67.2
Med42 (70B)	61.9	50.8	58.0	68.0
Phi-4 (14B)	48.7	25.2	60.0	62.2
ChemLLM (7B)	56.8	46.3	47.3	62.0
Phi-3 (3.8B)	60.0	42.7	49.0	59.5

Detailed Results by Task

Task 1: Information Representation

Best Performance: GPT-4o with 68.0% exact match on unstructured translation
Key Challenge: Converting between chemical representations accurately
Model Insights: Larger models generally perform better, with GPT-4o leading

Task 2: Emergency Response Generation

Evaluation Method: GPT-4 as judge for response quality assessment
Best Performance: GPT-4o with 52.7% judge approval rating
Challenge: All models struggle with comprehensive emergency protocols

Task 3: Domain Knowledge Q&A

Dataset: HazMat certification exam questions
Best Performance: GPT-4o with 63.9% accuracy
Observation: Performance varies significantly across model sizes
Best Performance: GPT-4 with 63.9% accuracy
Challenge Areas: Specific regulatory knowledge, technical procedures

Accessing Full Results

Complete evaluation results are available in the results/ directory:

results/t1_evaluations/ - Task 1 detailed results and metrics
results/t2_evaluations/ - Task 2 response quality assessments
results/t3_evaluations/ - Task 3 multiple choice accuracy
results/outputs_final/ - Raw model outputs for analysis

Key result files:

combined_model_scores_summary.xlsx - Comprehensive model comparison
exact_match_results_[model].csv - Detailed Task 1 results per model
translation_eval_hazmat.csv - Task 1 evaluation metrics

📄 Citation

If you use ChEmREF in your research, please cite our paper:

@misc{surana2025chemref,
  title={ChEmREF: Evaluating Language Model Readiness for Chemical Emergency Response},
  author={Surana, Risha and Ye, Qinyuan and Swayamdipta, Swabha},
  year={2025},
  eprint={arXiv:XXXX.XXXXX},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

📧 Contact

Risha Surana - [email protected]
Qinyuan Ye - [email protected]
Swabha Swayamdipta - [email protected]

🛡️ Safety Notice

This research is intended to assist emergency responders and should not be used as a replacement for professional training, official emergency protocols, or human expertise. Always consult certified emergency response professionals for actual incident management.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

Note: This repository contains research code. For production emergency response systems, please ensure proper validation, testing, and integration with official protocols.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
data		data
results		results
scripts		scripts
src		src
.env.template		.env.template
DATA_OVERVIEW.md		DATA_OVERVIEW.md
LICENSE		LICENSE
README.md		README.md
graphical-abstract.pdf		graphical-abstract.pdf
graphical-abstract.png		graphical-abstract.png
requirements.txt		requirements.txt

License

dill-lab/ChEmREF

Folders and files

Latest commit

History

Repository files navigation

ChEmREF: Evaluating Language Model Readiness for Chemical Emergency Response

Framework Overview

Abstract

🚀 Quick Start

Prerequisites

Installation

Environment Setup

📊 Dataset

Task 1: Information Representation (1,035 chemicals)

Task 2: Emergency Response Generation (100 chemical scenarios)

Task 3: Domain Knowledge Question Answering (617+ questions)

Supporting Data

🗂️ Repository Structure

🔧 Usage

Quick Start Evaluation

Individual Task Evaluation

Task 1: Chemical Information Representation

Task 2: Emergency Response Generation

Task 3: Domain Knowledge Q&A

Data Processing

Environment Setup

📋 Dependencies

Core Dependencies

Model Support

Data Processing

Evaluation

Utilities

📈 Results

Key Findings

Model Performance Comparison

Detailed Results by Task

Task 1: Information Representation

Task 2: Emergency Response Generation

Task 3: Domain Knowledge Q&A

Accessing Full Results

📄 Citation

📧 Contact

🛡️ Safety Notice

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages