Based on "Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity" (NAACL 2024), with Self-RAG reflection layers for improved answer quality.
Adaptive-RAG dynamically selects retrieval strategies based on query complexity:
- Strategy A (no retrieval) — direct LLM answer
- Strategy B (single-step retrieval) — one retrieval pass + LLM
- Strategy C (multi-step / IRCoT) — iterative chain-of-thought retrieval
A t5-large classifier routes each query to the appropriate strategy.
This fork adds Self-RAG (Self-Reflective RAG) as an inference-time enhancement:
- Phase 1 — Relevance Filtering: Embedding + LLM hybrid check on retrieved documents
- Phase 2 — Answer Verification: Support level assessment (fully/partially/not supported)
- Phase 3 — Confidence Scoring & Escalation: Routes to more complex strategies when confidence is low
conda create -n adaptiverag python=3.8
conda activate adaptiveragpip install 'torch>=1.7,!=1.12.0,<2.0'On Apple Silicon (M1/M2/M3), PyTorch supports MPS acceleration out of the box. The LLM server uses CPU by default.
pip install -r requirements.txt
python -m spacy download en_core_web_smKey dependencies:
| Package | Version | Purpose |
|---|---|---|
| torch | >=1.7, <2.0 | LLM inference |
| transformers | git pin (8637316) | Model loading |
| sentence-transformers | 2.2.2 | Embedding-based relevance (Self-RAG Phase 1) |
| huggingface_hub | 0.36.1 | Model downloads |
| pydantic | >=1.10.26, <2.0 | Config validation |
| spacy | 3.4.4 | NLP (Strategy C / IRCoT) |
| hypothesis | 6.113.0 | Property-based tests |
| elasticsearch | 7.9.1 | BM25 retrieval |
macOS (Homebrew):
# Option A: Homebrew (recommended for Apple Silicon)
brew tap elastic/tap
brew install elastic/tap/elasticsearch-full@7
elasticsearch # starts on localhost:9200
# Option B: Manual download
# For Intel Mac:
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-darwin-x86_64.tar.gz
tar -xzf elasticsearch-7.10.2-darwin-x86_64.tar.gz
# For Apple Silicon (M1/M2/M3) — use the no-jdk version + install JDK separately:
# brew install openjdk@17
# wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-no-jdk-darwin-x86_64.tar.gz
cd elasticsearch-7.10.2/
./bin/elasticsearch # starts on localhost:9200Verify it's running:
curl localhost:9200/_cat/health# Multi-hop datasets (MuSiQue, HotpotQA, 2WikiMultiHopQA) — preprocessed test sets
bash ./download/processed_data.sh
# Single-hop datasets (NQ, TriviaQA, SQuAD) — requires manual download from DPR
# See "Single-Hop Dataset Setup" section below
# Raw data for training the classifier (optional — only needed if retraining)
bash ./download/raw_data.sh# Multi-hop datasets (each uses its own corpus)
python retriever_server/build_index.py hotpotqa
python retriever_server/build_index.py 2wikimultihopqa
python retriever_server/build_index.py musique
# Single-hop datasets (all share the Wikipedia corpus)
python retriever_server/build_index.py wikiVerify indices:
curl localhost:9200/_cat/indices
# Expected sizes: hotpotqa (5,233,329), 2wikimultihopqa (430,225),
# musique (139,416), wiki (21,015,324)You need two servers running in separate terminals:
Terminal 1 — Retriever server:
conda activate adaptiverag
uvicorn serve:app --port 8000 --app-dir retriever_serverTerminal 2 — LLM server (flan-t5-xl):
conda activate adaptiverag
MODEL_NAME=flan-t5-xl uvicorn serve:app --port 8010 --app-dir llm_serverFirst run downloads the flan-t5-xl model (~12 GB). Subsequent runs use the cached model. On a Mac with 16 GB RAM, expect ~8-10 GB memory usage for flan-t5-xl.
python evaluate_final_acc.py# Dry run (5 questions)
python evaluate_selfrag_e2e.py --dataset trivia --max-questions 5
# Single dataset (full ~500 questions)
python evaluate_selfrag_e2e.py --dataset trivia
# All 6 datasets
python evaluate_selfrag_e2e.py --all-datasets
# Available datasets: nq, trivia, squad, musique, hotpotqa, 2wikimultihopqaOutput files:
selfrag_e2e_results_{dataset}.json— aggregate metricsselfrag_e2e_results_{dataset}_detailed.json— per-question results
pytest tests/ -vDownload from Facebook DPR:
# Natural Questions
mkdir -p raw_data/nq && cd raw_data/nq
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-dev.json.gz
gzip -d biencoder-nq-dev.json.gz
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-train.json.gz
gzip -d biencoder-nq-train.json.gz
cd ../..
# TriviaQA
mkdir -p raw_data/trivia && cd raw_data/trivia
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-trivia-dev.json.gz
gzip -d biencoder-trivia-dev.json.gz
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-trivia-train.json.gz
gzip -d biencoder-trivia-train.json.gz
cd ../..
# SQuAD
mkdir -p raw_data/squad && cd raw_data/squad
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-squad1-dev.json.gz
gzip -d biencoder-squad1-dev.json.gz
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-squad1-train.json.gz
gzip -d biencoder-squad1-train.json.gz
cd ../..
# Wikipedia passages (shared corpus for all single-hop datasets)
mkdir -p raw_data/wiki && cd raw_data/wiki
wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
gzip -d psgs_w100.tsv.gz
cd ../..
# Process and subsample
python ./processing_scripts/process_nq.py
python ./processing_scripts/process_trivia.py
python ./processing_scripts/process_squad.py
python processing_scripts/subsample_dataset_and_remap_paras.py nq test 500
python processing_scripts/subsample_dataset_and_remap_paras.py trivia test 500
python processing_scripts/subsample_dataset_and_remap_paras.py squad test 500If you want to retrain the query complexity classifier:
# Generate training data from retrieval strategy predictions
SYSTEM=ircot_qa MODEL=flan-t5-xl DATASET=nq LLM_PORT_NUM=8010
bash run_retrieval_dev.sh $SYSTEM $MODEL $DATASET $LLM_PORT_NUM
# Repeat for all 6 datasets × 3 strategies (ircot_qa, oner_qa, nor_qa)
# Preprocess classifier training data
python ./classifier/preprocess/preprocess_silver_train.py flan_t5_xl
python ./classifier/preprocess/preprocess_silver_valid.py flan_t5_xl
python ./classifier/preprocess/preprocess_binary_train.py
python ./classifier/preprocess/concat_binary_silver_train.py
# Train
cd classifier
bash ./run/run_large_train_xl.sh
cd ..
# Generate predictions
python ./classifier/postprocess/predict_complexity_on_classification_results.pyPre-computed classifier predictions are provided in predictions/classifier/.
├── commaqa/ # Core inference library
│ ├── inference/
│ │ ├── self_rag/ # Self-RAG modules (added)
│ │ │ ├── relevance_checker.py # Phase 1: embedding + LLM relevance
│ │ │ ├── answer_verifier.py # Phase 2: support level verification
│ │ │ ├── confidence_scorer.py # Phase 3: confidence scoring
│ │ │ ├── strategy_escalator.py # Escalation routing logic
│ │ │ ├── config.py # SelfRAGConfig dataclass
│ │ │ └── stats_tracker.py # Metrics tracking
│ │ ├── configurable_inference.py # Pipeline builder
│ │ ├── ircot.py # Strategy C (multi-step)
│ │ └── participant_execution_routed_selfrag.py # Self-RAG participant
│ └── models/
│ ├── llm_client_generator.py # LLM server client
│ └── gpt3generator.py # OpenAI API client
├── classifier/ # Query complexity classifier (t5-large)
├── retriever_server/ # Elasticsearch BM25 retriever
├── llm_server/ # FLAN-T5 inference server (uvicorn)
├── tests/ # Property-based + unit tests
├── evaluate_selfrag_e2e.py # Self-RAG end-to-end evaluation
├── evaluate_final_acc.py # Baseline Adaptive-RAG evaluation
├── base_configs/ # Jsonnet configs per dataset × strategy × model
├── predictions/ # Pre-computed baseline predictions
├── processed_data/ # Preprocessed test/dev JSONL files
└── requirements.txt
- flan-t5-xl (~3B params): ~2-5 seconds per LLM call on CPU. Strategy C questions can take 60-90s each due to multi-step reasoning (up to 10 steps × multiple LLM calls).
- Full 500-question dataset: ~1-2 hours for Strategy A/B heavy datasets, longer for Strategy C heavy ones.
- All 6 datasets: ~6-12 hours total.
- Memory: flan-t5-xl needs ~8-10 GB RAM. Elasticsearch needs ~2-4 GB depending on indices loaded.
- Disk: ~12 GB for flan-t5-xl model cache, ~5 GB for Elasticsearch indices, ~2 GB for datasets.
Based on the IRCoT skeleton code and the Adaptive-RAG repository.
@inproceedings{jeong2024adaptiverag,
author = {Soyeong Jeong and Jinheon Baek and Sukmin Cho and Sung Ju Hwang and Jong Park},
title = {Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity},
booktitle = {NAACL},
year = {2024},
url = {https://arxiv.org/abs/2403.14403}
}