Skip to content

dill-lab/ChEmREF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ChEmREF: Evaluating Language Model Readiness for Chemical Emergency Response

Paper License Python 3.8+

This repository contains the code and data for the paper "ChEmREF: Evaluating Language Model Readiness for Chemical Emergency Response" by Risha Surana, Qinyuan Ye, and Swabha Swayamdipta from the University of Southern California.

Framework Overview

ChEmREF Framework

Figure 1-2: Overview of the ChEmREF evaluation framework showing the three main tasks: (1) Chemical Information Representation, (2) Emergency Response Generation, and (3) Domain Knowledge Question Answering.

Abstract

Emergency responders managing hazardous material (HazMat) incidents face critical, time-sensitive decisions, manually navigating extensive chemical guidelines. We investigate whether today's language models can assist responders by rapidly and reliably understanding critical information, identifying hazards, and recommending appropriate action. We introduce the Chemical Emergency Response Evaluation Framework (ChEmREF), a new benchmark comprising 1,035 HazMat scenarios from the Emergency Response Guidebook and the PubChem Database. ChEmREF is organized into three tasks: (1) structured and unstructured information representation between structured and unstructured forms (e.g., converting "Cβ‚‚Hβ‚…O" to "ethanol"), (2) emergency response generation (e.g., recommending appropriate evacuation distances) and (3) domain knowledge question answering from chemical safety and certification exams. Our best automated models received an exact match of 68.0% on unstructured HazMat chemical representation tasks, with a LLM judge score of 52.7% on incident response recommendations, and a multiple-choice accuracy of 63.9% on HazMat examinations. These findings suggest that while language models show potential to assist emergency responders in various tasks, they require careful human oversight due to their current limitations.

πŸš€ Quick Start

Prerequisites

  • Python 3.8 or higher
  • CUDA-compatible GPU (optional, for faster inference)

Installation

  1. Clone this repository:
git clone https://github.com/rishasurana/ChEmREF.git
cd ChEmREF
  1. Create a virtual environment:
python -m venv chemref_env
source chemref_env/bin/activate  # On Windows: chemref_env\Scripts\activate
  1. Install required dependencies:
pip install -r requirements.txt

Environment Setup

Create a .env file in the root directory with your API keys:

OPENAI_API_KEY=your_openai_api_key
HUGGINGFACE_TOKEN=your_huggingface_token

πŸ“Š Dataset

The ChEmREF benchmark consists of 1,035 HazMat scenarios organized into three main evaluation tasks:

Task 1: Information Representation (1,035 chemicals)

  • Structured to Unstructured: Converting chemical formulas/IUPAC names to common names (e.g., "Cβ‚‚Hβ‚…O" β†’ "ethanol")
  • Unstructured to Structured: Converting common names to chemical formulas (e.g., "ethanol" β†’ "Cβ‚‚Hβ‚…O")
  • Data Source: data/task1_representation/hazmat_1035.csv
  • Evaluation: Exact match accuracy on bidirectional conversions

Task 2: Emergency Response Generation (100 chemical scenarios)

  • Evacuation distance recommendations
  • Safety protocol suggestions
  • Incident-specific response procedures
  • Fire/explosion hazard management
  • Health hazard assessment
  • Protective equipment recommendations
  • Data Source: data/task2_emergency_response/hazmat_100.csv
  • Evaluation: LLM judge scoring for response quality and safety

Task 3: Domain Knowledge Question Answering (617+ questions)

  • Commercial Driver's License (CDL) HazMat certification exams
  • Chemical safety certification exam questions
  • HazMat handling and transportation procedures
  • Emergency response protocols
  • Data Source: data/task3_domain_qa/hazmat_cdl_mc.csv, hazmat_quizzes_combined.csv
  • Evaluation: Multiple-choice accuracy

Supporting Data

  • Emergency Response Guidebook (ERG): Complete 2024 ERG data in JSON and CSV formats
  • PubChem Integration: Chemical property data from PubChem Database
  • Raw Processing Files: Intermediate datasets for reproducibility

For detailed information about all datasets, see DATA_OVERVIEW.md.

πŸ—‚οΈ Repository Structure

ChEmREF/
β”œβ”€β”€ data/                           # Dataset files
β”‚   β”œβ”€β”€ task1_representation/       # Information representation tasks
β”‚   β”‚   └── hazmat_1035.csv        # Main chemical representation dataset (1,035 chemicals)
β”‚   β”œβ”€β”€ task2_emergency_response/   # Emergency response generation
β”‚   β”‚   └── hazmat_100.csv         # Emergency response scenarios (100 chemicals)
β”‚   β”œβ”€β”€ task3_domain_qa/           # Domain knowledge Q&A
β”‚   β”‚   β”œβ”€β”€ hazmat_cdl_mc.csv      # Commercial Driver's License exam questions
β”‚   β”‚   β”œβ”€β”€ hazmat_quizzes_combined.csv # Combined HazMat certification quizzes
β”‚   β”‚   β”œβ”€β”€ quiz_reference.csv     # Reference quiz data
β”‚   β”‚   └── Hazmat_Awareness_Practice_Test.csv # Additional practice test questions
β”‚   └── build/                     # Data processing and source files
β”‚       β”œβ”€β”€ build-ERG/             # Emergency Response Guidebook processing
β”‚       β”‚   β”œβ”€β”€ ERG_2024_Guide_Materials.json # Structured ERG data
β”‚       β”‚   β”œβ”€β”€ erg_table_data_full.csv # Complete ERG table data
β”‚       β”‚   └── [ERG Excel files]  # Raw ERG data by section (Yellow, Blue, Orange, Green)
β”‚       β”œβ”€β”€ build-hazmat-exams/    # HazMat exam data processing
β”‚       β”‚   β”œβ”€β”€ hazmat-cdl/        # CDL-specific exam data
β”‚       β”‚   β”‚   β”œβ”€β”€ cdl.ipynb      # CDL data processing notebook
β”‚       β”‚   β”‚   β”œβ”€β”€ hazmat_cdl_mc.csv # CDL multiple choice questions
β”‚       β”‚   β”‚   └── hazmat_100s.csv # CDL practice questions (100-item sets)
β”‚       β”‚   β”œβ”€β”€ hazmat_quizzes_proprofs/ # ProProfs quiz data
β”‚       β”‚   β”œβ”€β”€ build.ipynb        # Main exam data processing notebook
β”‚       β”‚   └── hazmat_quizzes_combined.csv # All quiz data combined
β”‚       β”œβ”€β”€ build_erg.ipynb        # ERG data processing notebook
β”‚       └── build_hazmat_data.ipynb # HazMat data compilation notebook
β”œβ”€β”€ src/                          # Source code
β”‚   └── run_evaluation.py         # Main evaluation script
β”œβ”€β”€ scripts/                       # Execution and processing scripts
β”‚   β”œβ”€β”€ evaluation/                # Model evaluation scripts
β”‚   β”‚   β”œβ”€β”€ run_t1.py             # Task 1: Chemical representation evaluation (all models)
β”‚   β”‚   β”œβ”€β”€ run_t1_1.py           # Task 1 subtask: IUPAC to common name
β”‚   β”‚   β”œβ”€β”€ run_t1_2.py           # Task 1 subtask: Formula to common name  
β”‚   β”‚   β”œβ”€β”€ run_t1_gpt4.py        # Task 1 evaluation with GPT-4
β”‚   β”‚   β”œβ”€β”€ run_t1_un_gpt4.py     # Task 1 unstructured evaluation with GPT-4
β”‚   β”‚   β”œβ”€β”€ run_t2_gpt4.py        # Task 2: Emergency response with GPT-4
β”‚   β”‚   β”œβ”€β”€ run_t3.py             # Task 3: Domain knowledge evaluation (all models)
β”‚   β”‚   └── run_t3_gpt4.py        # Task 3 evaluation with GPT-4
β”‚   └── run_scripts/               # Shell scripts and supporting files
β”‚       β”œβ”€β”€ gpt4-llm-judge/        # GPT-4 judge evaluation scripts
β”‚       β”‚   β”œβ”€β”€ gpt4_judge_test.py # LLM judge implementation
β”‚       β”‚   └── hazmat_100.csv     # Test data for judge evaluation
β”‚       └── [shell scripts]        # Batch execution scripts (.sh files)
β”œβ”€β”€ results/                       # Experimental results
β”‚   β”œβ”€β”€ t1_evaluations/           # Task 1 evaluation results
β”‚   β”œβ”€β”€ t2_evaluations/           # Task 2 evaluation results
β”‚   β”œβ”€β”€ t3_evaluations/           # Task 3 evaluation results
β”‚   └── outputs_final/            # Final model outputs and comparisons
β”œβ”€β”€ configs/                      # Configuration files
β”‚   └── evaluation.yaml           # Evaluation settings and model configurations
β”œβ”€β”€ graphical-abstract.pdf        # Framework overview figure
β”œβ”€β”€ requirements.txt              # Python dependencies
β”œβ”€β”€ .env.template                 # Template for environment variables
β”œβ”€β”€ DATA_OVERVIEW.md              # Detailed dataset documentation
└── README.md                     # This file

πŸ”§ Usage

Quick Start Evaluation

# Run all tasks with GPT-4 (recommended)
python src/run_evaluation.py --task all --model gpt-4

# Run specific task
python src/run_evaluation.py --task task1 --model gpt-4
python src/run_evaluation.py --task task2 --model gpt-4  
python src/run_evaluation.py --task task3 --model gpt-4

# Run with open-source models
python src/run_evaluation.py --task task1 --model phi4
python src/run_evaluation.py --task task1 --model chemllm-7B

Individual Task Evaluation

Task 1: Chemical Information Representation

# Run chemical name/formula conversion evaluation
python scripts/evaluation/run_t1.py

# Run with GPT-4 specifically
python scripts/evaluation/run_t1_gpt4.py

# Run bidirectional conversion (structured ↔ unstructured)
python scripts/evaluation/run_t1_1.py  # IUPAC to Common Name
python scripts/evaluation/run_t1_2.py  # Formula to Common Name

Task 2: Emergency Response Generation

# Run emergency response evaluation with GPT-4
python scripts/evaluation/run_t2_gpt4.py

# Generate responses for specific hazard scenarios
python scripts/evaluation/run_t2_gpt4.py --scenario_type "Fire or Explosion Hazard"

Task 3: Domain Knowledge Q&A

# Run HazMat certification exam evaluation
python scripts/evaluation/run_t3.py

# Run with GPT-4 specifically
python scripts/evaluation/run_t3_gpt4.py

Data Processing

To recreate the datasets from raw sources:

# Process Emergency Response Guidebook data
jupyter notebook data/build/build_erg.ipynb

# Build comprehensive HazMat dataset
jupyter notebook data/build/build_hazmat_data.ipynb

# Process HazMat exam and certification data
jupyter notebook data/build/build-hazmat-exams/build.ipynb

# Process CDL-specific exam data
jupyter notebook data/build/build-hazmat-exams/hazmat-cdl/cdl.ipynb

Environment Setup

  1. Copy the environment template:
cp .env.template .env
  1. Edit .env with your API keys:
OPENAI_API_KEY=your_openai_api_key_here
HUGGINGFACE_TOKEN=your_huggingface_token_here

πŸ“‹ Dependencies

Core Dependencies

  • torch>=1.13.0 - PyTorch for model implementations
  • transformers>=4.25.0 - Hugging Face Transformers library
  • vllm>=0.2.0 - vLLM for efficient inference with large models
  • openai>=1.0.0 - OpenAI API client for GPT models

Model Support

The codebase supports evaluation with:

  • OpenAI Models: GPT-4, GPT-3.5-turbo
  • Open Source Models:
    • microsoft/phi-4 - Microsoft Phi-4 model
    • AI4Chem/ChemLLM-7B-Chat - Chemistry-specialized language model
    • meta-llama/Llama-3.1-70B-Instruct - Llama 3.1 70B
    • m42-health/Llama3-Med42-70B - Medical/health specialized model
    • microsoft/Phi-3-mini-4k-instruct - Phi-3 mini model

Data Processing

  • pandas>=1.5.0 - Data manipulation
  • numpy>=1.21.0 - Numerical computing
  • datasets>=2.8.0 - Hugging Face Datasets

Evaluation

  • scikit-learn>=1.1.0 - Metrics and evaluation
  • rouge-score>=0.1.2 - ROUGE metrics
  • bleurt>=0.0.2 - BLEURT evaluation

Utilities

  • tqdm>=4.64.0 - Progress bars
  • pyyaml>=6.0 - YAML configuration
  • python-dotenv>=0.19.0 - Environment variables

See requirements.txt for complete dependency list with exact versions.

πŸ“ˆ Results

Our evaluation reveals that current language models show promise but require significant improvements for reliable emergency response assistance:

Key Findings

  • Information Representation: 68.0% exact match accuracy on chemical name/formula conversion (GPT-4o)
  • Emergency Response: 52.7% LLM judge score on incident response recommendations (GPT-4o)
  • Domain Knowledge: 63.9% accuracy on HazMat certification exam questions (GPT-4o)

Model Performance Comparison

Model Task 1: Translation (Unstructured EM %) Task 2: Emergency Response (LLM Judge %) Task 3: Exam (Acc %) Overall Avg
GPT-4o 68.0 52.7 63.9 71.4
Llama-3.1 (70B) 67.1 50.7 60.0 67.2
Med42 (70B) 61.9 50.8 58.0 68.0
Phi-4 (14B) 48.7 25.2 60.0 62.2
ChemLLM (7B) 56.8 46.3 47.3 62.0
Phi-3 (3.8B) 60.0 42.7 49.0 59.5

Detailed Results by Task

Task 1: Information Representation

  • Best Performance: GPT-4o with 68.0% exact match on unstructured translation
  • Key Challenge: Converting between chemical representations accurately
  • Model Insights: Larger models generally perform better, with GPT-4o leading

Task 2: Emergency Response Generation

  • Evaluation Method: GPT-4 as judge for response quality assessment
  • Best Performance: GPT-4o with 52.7% judge approval rating
  • Challenge: All models struggle with comprehensive emergency protocols

Task 3: Domain Knowledge Q&A

  • Dataset: HazMat certification exam questions
  • Best Performance: GPT-4o with 63.9% accuracy
  • Observation: Performance varies significantly across model sizes
  • Best Performance: GPT-4 with 63.9% accuracy
  • Challenge Areas: Specific regulatory knowledge, technical procedures

Accessing Full Results

Complete evaluation results are available in the results/ directory:

  • results/t1_evaluations/ - Task 1 detailed results and metrics
  • results/t2_evaluations/ - Task 2 response quality assessments
  • results/t3_evaluations/ - Task 3 multiple choice accuracy
  • results/outputs_final/ - Raw model outputs for analysis

Key result files:

  • combined_model_scores_summary.xlsx - Comprehensive model comparison
  • exact_match_results_[model].csv - Detailed Task 1 results per model
  • translation_eval_hazmat.csv - Task 1 evaluation metrics

πŸ“„ Citation

If you use ChEmREF in your research, please cite our paper:

@misc{surana2025chemref,
  title={ChEmREF: Evaluating Language Model Readiness for Chemical Emergency Response},
  author={Surana, Risha and Ye, Qinyuan and Swayamdipta, Swabha},
  year={2025},
  eprint={arXiv:XXXX.XXXXX},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

πŸ“§ Contact

πŸ›‘οΈ Safety Notice

This research is intended to assist emergency responders and should not be used as a replacement for professional training, official emergency protocols, or human expertise. Always consult certified emergency response professionals for actual incident management.

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

Note: This repository contains research code. For production emergency response systems, please ensure proper validation, testing, and integration with official protocols.

About

Code for paper "ChEmREF: Evaluating Language Model Readiness for Chemical Emergency Response"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published