This repository contains the code and data for the paper "ChEmREF: Evaluating Language Model Readiness for Chemical Emergency Response" by Risha Surana, Qinyuan Ye, and Swabha Swayamdipta from the University of Southern California.
Figure 1-2: Overview of the ChEmREF evaluation framework showing the three main tasks: (1) Chemical Information Representation, (2) Emergency Response Generation, and (3) Domain Knowledge Question Answering.
Emergency responders managing hazardous material (HazMat) incidents face critical, time-sensitive decisions, manually navigating extensive chemical guidelines. We investigate whether today's language models can assist responders by rapidly and reliably understanding critical information, identifying hazards, and recommending appropriate action. We introduce the Chemical Emergency Response Evaluation Framework (ChEmREF), a new benchmark comprising 1,035 HazMat scenarios from the Emergency Response Guidebook and the PubChem Database. ChEmREF is organized into three tasks: (1) structured and unstructured information representation between structured and unstructured forms (e.g., converting "CβHβ O" to "ethanol"), (2) emergency response generation (e.g., recommending appropriate evacuation distances) and (3) domain knowledge question answering from chemical safety and certification exams. Our best automated models received an exact match of 68.0% on unstructured HazMat chemical representation tasks, with a LLM judge score of 52.7% on incident response recommendations, and a multiple-choice accuracy of 63.9% on HazMat examinations. These findings suggest that while language models show potential to assist emergency responders in various tasks, they require careful human oversight due to their current limitations.
- Python 3.8 or higher
- CUDA-compatible GPU (optional, for faster inference)
- Clone this repository:
git clone https://github.com/rishasurana/ChEmREF.git
cd ChEmREF- Create a virtual environment:
python -m venv chemref_env
source chemref_env/bin/activate # On Windows: chemref_env\Scripts\activate- Install required dependencies:
pip install -r requirements.txtCreate a .env file in the root directory with your API keys:
OPENAI_API_KEY=your_openai_api_key
HUGGINGFACE_TOKEN=your_huggingface_tokenThe ChEmREF benchmark consists of 1,035 HazMat scenarios organized into three main evaluation tasks:
- Structured to Unstructured: Converting chemical formulas/IUPAC names to common names (e.g., "CβHβ O" β "ethanol")
- Unstructured to Structured: Converting common names to chemical formulas (e.g., "ethanol" β "CβHβ O")
- Data Source:
data/task1_representation/hazmat_1035.csv - Evaluation: Exact match accuracy on bidirectional conversions
- Evacuation distance recommendations
- Safety protocol suggestions
- Incident-specific response procedures
- Fire/explosion hazard management
- Health hazard assessment
- Protective equipment recommendations
- Data Source:
data/task2_emergency_response/hazmat_100.csv - Evaluation: LLM judge scoring for response quality and safety
- Commercial Driver's License (CDL) HazMat certification exams
- Chemical safety certification exam questions
- HazMat handling and transportation procedures
- Emergency response protocols
- Data Source:
data/task3_domain_qa/hazmat_cdl_mc.csv,hazmat_quizzes_combined.csv - Evaluation: Multiple-choice accuracy
- Emergency Response Guidebook (ERG): Complete 2024 ERG data in JSON and CSV formats
- PubChem Integration: Chemical property data from PubChem Database
- Raw Processing Files: Intermediate datasets for reproducibility
For detailed information about all datasets, see DATA_OVERVIEW.md.
ChEmREF/
βββ data/ # Dataset files
β βββ task1_representation/ # Information representation tasks
β β βββ hazmat_1035.csv # Main chemical representation dataset (1,035 chemicals)
β βββ task2_emergency_response/ # Emergency response generation
β β βββ hazmat_100.csv # Emergency response scenarios (100 chemicals)
β βββ task3_domain_qa/ # Domain knowledge Q&A
β β βββ hazmat_cdl_mc.csv # Commercial Driver's License exam questions
β β βββ hazmat_quizzes_combined.csv # Combined HazMat certification quizzes
β β βββ quiz_reference.csv # Reference quiz data
β β βββ Hazmat_Awareness_Practice_Test.csv # Additional practice test questions
β βββ build/ # Data processing and source files
β βββ build-ERG/ # Emergency Response Guidebook processing
β β βββ ERG_2024_Guide_Materials.json # Structured ERG data
β β βββ erg_table_data_full.csv # Complete ERG table data
β β βββ [ERG Excel files] # Raw ERG data by section (Yellow, Blue, Orange, Green)
β βββ build-hazmat-exams/ # HazMat exam data processing
β β βββ hazmat-cdl/ # CDL-specific exam data
β β β βββ cdl.ipynb # CDL data processing notebook
β β β βββ hazmat_cdl_mc.csv # CDL multiple choice questions
β β β βββ hazmat_100s.csv # CDL practice questions (100-item sets)
β β βββ hazmat_quizzes_proprofs/ # ProProfs quiz data
β β βββ build.ipynb # Main exam data processing notebook
β β βββ hazmat_quizzes_combined.csv # All quiz data combined
β βββ build_erg.ipynb # ERG data processing notebook
β βββ build_hazmat_data.ipynb # HazMat data compilation notebook
βββ src/ # Source code
β βββ run_evaluation.py # Main evaluation script
βββ scripts/ # Execution and processing scripts
β βββ evaluation/ # Model evaluation scripts
β β βββ run_t1.py # Task 1: Chemical representation evaluation (all models)
β β βββ run_t1_1.py # Task 1 subtask: IUPAC to common name
β β βββ run_t1_2.py # Task 1 subtask: Formula to common name
β β βββ run_t1_gpt4.py # Task 1 evaluation with GPT-4
β β βββ run_t1_un_gpt4.py # Task 1 unstructured evaluation with GPT-4
β β βββ run_t2_gpt4.py # Task 2: Emergency response with GPT-4
β β βββ run_t3.py # Task 3: Domain knowledge evaluation (all models)
β β βββ run_t3_gpt4.py # Task 3 evaluation with GPT-4
β βββ run_scripts/ # Shell scripts and supporting files
β βββ gpt4-llm-judge/ # GPT-4 judge evaluation scripts
β β βββ gpt4_judge_test.py # LLM judge implementation
β β βββ hazmat_100.csv # Test data for judge evaluation
β βββ [shell scripts] # Batch execution scripts (.sh files)
βββ results/ # Experimental results
β βββ t1_evaluations/ # Task 1 evaluation results
β βββ t2_evaluations/ # Task 2 evaluation results
β βββ t3_evaluations/ # Task 3 evaluation results
β βββ outputs_final/ # Final model outputs and comparisons
βββ configs/ # Configuration files
β βββ evaluation.yaml # Evaluation settings and model configurations
βββ graphical-abstract.pdf # Framework overview figure
βββ requirements.txt # Python dependencies
βββ .env.template # Template for environment variables
βββ DATA_OVERVIEW.md # Detailed dataset documentation
βββ README.md # This file
# Run all tasks with GPT-4 (recommended)
python src/run_evaluation.py --task all --model gpt-4
# Run specific task
python src/run_evaluation.py --task task1 --model gpt-4
python src/run_evaluation.py --task task2 --model gpt-4
python src/run_evaluation.py --task task3 --model gpt-4
# Run with open-source models
python src/run_evaluation.py --task task1 --model phi4
python src/run_evaluation.py --task task1 --model chemllm-7B# Run chemical name/formula conversion evaluation
python scripts/evaluation/run_t1.py
# Run with GPT-4 specifically
python scripts/evaluation/run_t1_gpt4.py
# Run bidirectional conversion (structured β unstructured)
python scripts/evaluation/run_t1_1.py # IUPAC to Common Name
python scripts/evaluation/run_t1_2.py # Formula to Common Name# Run emergency response evaluation with GPT-4
python scripts/evaluation/run_t2_gpt4.py
# Generate responses for specific hazard scenarios
python scripts/evaluation/run_t2_gpt4.py --scenario_type "Fire or Explosion Hazard"# Run HazMat certification exam evaluation
python scripts/evaluation/run_t3.py
# Run with GPT-4 specifically
python scripts/evaluation/run_t3_gpt4.pyTo recreate the datasets from raw sources:
# Process Emergency Response Guidebook data
jupyter notebook data/build/build_erg.ipynb
# Build comprehensive HazMat dataset
jupyter notebook data/build/build_hazmat_data.ipynb
# Process HazMat exam and certification data
jupyter notebook data/build/build-hazmat-exams/build.ipynb
# Process CDL-specific exam data
jupyter notebook data/build/build-hazmat-exams/hazmat-cdl/cdl.ipynb- Copy the environment template:
cp .env.template .env- Edit
.envwith your API keys:
OPENAI_API_KEY=your_openai_api_key_here
HUGGINGFACE_TOKEN=your_huggingface_token_heretorch>=1.13.0- PyTorch for model implementationstransformers>=4.25.0- Hugging Face Transformers libraryvllm>=0.2.0- vLLM for efficient inference with large modelsopenai>=1.0.0- OpenAI API client for GPT models
The codebase supports evaluation with:
- OpenAI Models: GPT-4, GPT-3.5-turbo
- Open Source Models:
microsoft/phi-4- Microsoft Phi-4 modelAI4Chem/ChemLLM-7B-Chat- Chemistry-specialized language modelmeta-llama/Llama-3.1-70B-Instruct- Llama 3.1 70Bm42-health/Llama3-Med42-70B- Medical/health specialized modelmicrosoft/Phi-3-mini-4k-instruct- Phi-3 mini model
pandas>=1.5.0- Data manipulationnumpy>=1.21.0- Numerical computingdatasets>=2.8.0- Hugging Face Datasets
scikit-learn>=1.1.0- Metrics and evaluationrouge-score>=0.1.2- ROUGE metricsbleurt>=0.0.2- BLEURT evaluation
tqdm>=4.64.0- Progress barspyyaml>=6.0- YAML configurationpython-dotenv>=0.19.0- Environment variables
See requirements.txt for complete dependency list with exact versions.
Our evaluation reveals that current language models show promise but require significant improvements for reliable emergency response assistance:
- Information Representation: 68.0% exact match accuracy on chemical name/formula conversion (GPT-4o)
- Emergency Response: 52.7% LLM judge score on incident response recommendations (GPT-4o)
- Domain Knowledge: 63.9% accuracy on HazMat certification exam questions (GPT-4o)
| Model | Task 1: Translation (Unstructured EM %) | Task 2: Emergency Response (LLM Judge %) | Task 3: Exam (Acc %) | Overall Avg |
|---|---|---|---|---|
| GPT-4o | 68.0 | 52.7 | 63.9 | 71.4 |
| Llama-3.1 (70B) | 67.1 | 50.7 | 60.0 | 67.2 |
| Med42 (70B) | 61.9 | 50.8 | 58.0 | 68.0 |
| Phi-4 (14B) | 48.7 | 25.2 | 60.0 | 62.2 |
| ChemLLM (7B) | 56.8 | 46.3 | 47.3 | 62.0 |
| Phi-3 (3.8B) | 60.0 | 42.7 | 49.0 | 59.5 |
- Best Performance: GPT-4o with 68.0% exact match on unstructured translation
- Key Challenge: Converting between chemical representations accurately
- Model Insights: Larger models generally perform better, with GPT-4o leading
- Evaluation Method: GPT-4 as judge for response quality assessment
- Best Performance: GPT-4o with 52.7% judge approval rating
- Challenge: All models struggle with comprehensive emergency protocols
- Dataset: HazMat certification exam questions
- Best Performance: GPT-4o with 63.9% accuracy
- Observation: Performance varies significantly across model sizes
- Best Performance: GPT-4 with 63.9% accuracy
- Challenge Areas: Specific regulatory knowledge, technical procedures
Complete evaluation results are available in the results/ directory:
results/t1_evaluations/- Task 1 detailed results and metricsresults/t2_evaluations/- Task 2 response quality assessmentsresults/t3_evaluations/- Task 3 multiple choice accuracyresults/outputs_final/- Raw model outputs for analysis
Key result files:
combined_model_scores_summary.xlsx- Comprehensive model comparisonexact_match_results_[model].csv- Detailed Task 1 results per modeltranslation_eval_hazmat.csv- Task 1 evaluation metrics
If you use ChEmREF in your research, please cite our paper:
@misc{surana2025chemref,
title={ChEmREF: Evaluating Language Model Readiness for Chemical Emergency Response},
author={Surana, Risha and Ye, Qinyuan and Swayamdipta, Swabha},
year={2025},
eprint={arXiv:XXXX.XXXXX},
archivePrefix={arXiv},
primaryClass={cs.CL}
}- Risha Surana - [email protected]
- Qinyuan Ye - [email protected]
- Swabha Swayamdipta - [email protected]
This research is intended to assist emergency responders and should not be used as a replacement for professional training, official emergency protocols, or human expertise. Always consult certified emergency response professionals for actual incident management.
This project is licensed under the MIT License - see the LICENSE file for details.
Note: This repository contains research code. For production emergency response systems, please ensure proper validation, testing, and integration with official protocols.
