Adaptive-RAG + Self-RAG

Based on "Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity" (NAACL 2024), with Self-RAG reflection layers for improved answer quality.

Overview

Adaptive-RAG dynamically selects retrieval strategies based on query complexity:

Strategy A (no retrieval) — direct LLM answer
Strategy B (single-step retrieval) — one retrieval pass + LLM
Strategy C (multi-step / IRCoT) — iterative chain-of-thought retrieval

A t5-large classifier routes each query to the appropriate strategy.

This fork adds Self-RAG (Self-Reflective RAG) as an inference-time enhancement:

Phase 1 — Relevance Filtering: Embedding + LLM hybrid check on retrieved documents
Phase 2 — Answer Verification: Support level assessment (fully/partially/not supported)
Phase 3 — Confidence Scoring & Escalation: Routes to more complex strategies when confidence is low

Setup (macOS)

1. Create Conda Environment

conda create -n adaptiverag python=3.8
conda activate adaptiverag

2. Install PyTorch (CPU)

pip install 'torch>=1.7,!=1.12.0,<2.0'

On Apple Silicon (M1/M2/M3), PyTorch supports MPS acceleration out of the box. The LLM server uses CPU by default.

3. Install Python Dependencies

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Key dependencies:

Package	Version	Purpose
torch	>=1.7, <2.0	LLM inference
transformers	git pin (8637316)	Model loading
sentence-transformers	2.2.2	Embedding-based relevance (Self-RAG Phase 1)
huggingface_hub	0.36.1	Model downloads
pydantic	>=1.10.26, <2.0	Config validation
spacy	3.4.4	NLP (Strategy C / IRCoT)
hypothesis	6.113.0	Property-based tests
elasticsearch	7.9.1	BM25 retrieval

4. Set Up Elasticsearch (Retriever Backend)

macOS (Homebrew):

# Option A: Homebrew (recommended for Apple Silicon)
brew tap elastic/tap
brew install elastic/tap/elasticsearch-full@7
elasticsearch  # starts on localhost:9200

# Option B: Manual download
# For Intel Mac:
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-darwin-x86_64.tar.gz
tar -xzf elasticsearch-7.10.2-darwin-x86_64.tar.gz

# For Apple Silicon (M1/M2/M3) — use the no-jdk version + install JDK separately:
# brew install openjdk@17
# wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-no-jdk-darwin-x86_64.tar.gz

cd elasticsearch-7.10.2/
./bin/elasticsearch  # starts on localhost:9200

Verify it's running:

curl localhost:9200/_cat/health

5. Download Datasets

# Multi-hop datasets (MuSiQue, HotpotQA, 2WikiMultiHopQA) — preprocessed test sets
bash ./download/processed_data.sh

# Single-hop datasets (NQ, TriviaQA, SQuAD) — requires manual download from DPR
# See "Single-Hop Dataset Setup" section below

# Raw data for training the classifier (optional — only needed if retraining)
bash ./download/raw_data.sh

6. Build Elasticsearch Indices

# Multi-hop datasets (each uses its own corpus)
python retriever_server/build_index.py hotpotqa
python retriever_server/build_index.py 2wikimultihopqa
python retriever_server/build_index.py musique

# Single-hop datasets (all share the Wikipedia corpus)
python retriever_server/build_index.py wiki

Verify indices:

curl localhost:9200/_cat/indices
# Expected sizes: hotpotqa (5,233,329), 2wikimultihopqa (430,225),
# musique (139,416), wiki (21,015,324)

7. Start Servers

You need two servers running in separate terminals:

Terminal 1 — Retriever server:

conda activate adaptiverag
uvicorn serve:app --port 8000 --app-dir retriever_server

Terminal 2 — LLM server (flan-t5-xl):

conda activate adaptiverag
MODEL_NAME=flan-t5-xl uvicorn serve:app --port 8010 --app-dir llm_server

First run downloads the flan-t5-xl model (~12 GB). Subsequent runs use the cached model. On a Mac with 16 GB RAM, expect ~8-10 GB memory usage for flan-t5-xl.

Running Evaluations

Baseline Adaptive-RAG (classifier-only)

python evaluate_final_acc.py

Self-RAG Enhanced Evaluation (live inference)

# Dry run (5 questions)
python evaluate_selfrag_e2e.py --dataset trivia --max-questions 5

# Single dataset (full ~500 questions)
python evaluate_selfrag_e2e.py --dataset trivia

# All 6 datasets
python evaluate_selfrag_e2e.py --all-datasets

# Available datasets: nq, trivia, squad, musique, hotpotqa, 2wikimultihopqa

Output files:

selfrag_e2e_results_{dataset}.json — aggregate metrics
selfrag_e2e_results_{dataset}_detailed.json — per-question results

Running Tests

pytest tests/ -v

Single-Hop Dataset Setup

Download from Facebook DPR:

# Natural Questions
mkdir -p raw_data/nq && cd raw_data/nq
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-dev.json.gz
gzip -d biencoder-nq-dev.json.gz
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-train.json.gz
gzip -d biencoder-nq-train.json.gz
cd ../..

# TriviaQA
mkdir -p raw_data/trivia && cd raw_data/trivia
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-trivia-dev.json.gz
gzip -d biencoder-trivia-dev.json.gz
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-trivia-train.json.gz
gzip -d biencoder-trivia-train.json.gz
cd ../..

# SQuAD
mkdir -p raw_data/squad && cd raw_data/squad
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-squad1-dev.json.gz
gzip -d biencoder-squad1-dev.json.gz
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-squad1-train.json.gz
gzip -d biencoder-squad1-train.json.gz
cd ../..

# Wikipedia passages (shared corpus for all single-hop datasets)
mkdir -p raw_data/wiki && cd raw_data/wiki
wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
gzip -d psgs_w100.tsv.gz
cd ../..

# Process and subsample
python ./processing_scripts/process_nq.py
python ./processing_scripts/process_trivia.py
python ./processing_scripts/process_squad.py
python processing_scripts/subsample_dataset_and_remap_paras.py nq test 500
python processing_scripts/subsample_dataset_and_remap_paras.py trivia test 500
python processing_scripts/subsample_dataset_and_remap_paras.py squad test 500

Training the Classifier (Optional)

If you want to retrain the query complexity classifier:

# Generate training data from retrieval strategy predictions
SYSTEM=ircot_qa MODEL=flan-t5-xl DATASET=nq LLM_PORT_NUM=8010
bash run_retrieval_dev.sh $SYSTEM $MODEL $DATASET $LLM_PORT_NUM
# Repeat for all 6 datasets × 3 strategies (ircot_qa, oner_qa, nor_qa)

# Preprocess classifier training data
python ./classifier/preprocess/preprocess_silver_train.py flan_t5_xl
python ./classifier/preprocess/preprocess_silver_valid.py flan_t5_xl
python ./classifier/preprocess/preprocess_binary_train.py
python ./classifier/preprocess/concat_binary_silver_train.py

# Train
cd classifier
bash ./run/run_large_train_xl.sh
cd ..

# Generate predictions
python ./classifier/postprocess/predict_complexity_on_classification_results.py

Pre-computed classifier predictions are provided in predictions/classifier/.

Project Structure

├── commaqa/                    # Core inference library
│   ├── inference/
│   │   ├── self_rag/           # Self-RAG modules (added)
│   │   │   ├── relevance_checker.py   # Phase 1: embedding + LLM relevance
│   │   │   ├── answer_verifier.py     # Phase 2: support level verification
│   │   │   ├── confidence_scorer.py   # Phase 3: confidence scoring
│   │   │   ├── strategy_escalator.py  # Escalation routing logic
│   │   │   ├── config.py              # SelfRAGConfig dataclass
│   │   │   └── stats_tracker.py       # Metrics tracking
│   │   ├── configurable_inference.py  # Pipeline builder
│   │   ├── ircot.py                   # Strategy C (multi-step)
│   │   └── participant_execution_routed_selfrag.py  # Self-RAG participant
│   └── models/
│       ├── llm_client_generator.py    # LLM server client
│       └── gpt3generator.py           # OpenAI API client
├── classifier/                 # Query complexity classifier (t5-large)
├── retriever_server/           # Elasticsearch BM25 retriever
├── llm_server/                 # FLAN-T5 inference server (uvicorn)
├── tests/                      # Property-based + unit tests
├── evaluate_selfrag_e2e.py     # Self-RAG end-to-end evaluation
├── evaluate_final_acc.py       # Baseline Adaptive-RAG evaluation
├── base_configs/               # Jsonnet configs per dataset × strategy × model
├── predictions/                # Pre-computed baseline predictions
├── processed_data/             # Preprocessed test/dev JSONL files
└── requirements.txt

Performance Notes (macOS)

flan-t5-xl (~3B params): ~2-5 seconds per LLM call on CPU. Strategy C questions can take 60-90s each due to multi-step reasoning (up to 10 steps × multiple LLM calls).
Full 500-question dataset: ~1-2 hours for Strategy A/B heavy datasets, longer for Strategy C heavy ones.
All 6 datasets: ~6-12 hours total.
Memory: flan-t5-xl needs ~8-10 GB RAM. Elasticsearch needs ~2-4 GB depending on indices loaded.
Disk: ~12 GB for flan-t5-xl model cache, ~5 GB for Elasticsearch indices, ~2 GB for datasets.

Acknowledgement

Based on the IRCoT skeleton code and the Adaptive-RAG repository.

Citation

@inproceedings{jeong2024adaptiverag,
  author       = {Soyeong Jeong and Jinheon Baek and Sukmin Cho and Sung Ju Hwang and Jong Park},
  title        = {Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity},
  booktitle    = {NAACL},
  year         = {2024},
  url          = {https://arxiv.org/abs/2403.14403}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.hypothesis/unicode_data/12.1.0		.hypothesis/unicode_data/12.1.0
.kiro/specs		.kiro/specs
base_configs		base_configs
classifier		classifier
commaqa		commaqa
download		download
images		images
llm_server		llm_server
metrics		metrics
official_evaluation		official_evaluation
plots		plots
processing_scripts		processing_scripts
prompt_generator		prompt_generator
prompts		prompts
retriever_server		retriever_server
tests		tests
.gitignore		.gitignore
.llm_server_address.jsonnet		.llm_server_address.jsonnet
.retriever_address.jsonnet		.retriever_address.jsonnet
ADAPTIVE_RAG_EXPLAINED.md		ADAPTIVE_RAG_EXPLAINED.md
EVALUATION_QUICKSTART.md		EVALUATION_QUICKSTART.md
EVALUATION_RESULTS.md		EVALUATION_RESULTS.md
LICENSE		LICENSE
NEXT_STEPS.md		NEXT_STEPS.md
PAPER.md		PAPER.md
PLOTS_PAPER_NOTES.md		PLOTS_PAPER_NOTES.md
README.md		README.md
SELF_RAG_EVALUATION_RESULTS.md		SELF_RAG_EVALUATION_RESULTS.md
SELF_RAG_EXTENSION_RESULTS.md		SELF_RAG_EXTENSION_RESULTS.md
SELF_RAG_INTEGRATION_GUIDE.md		SELF_RAG_INTEGRATION_GUIDE.md
SELF_RAG_QUICK_REFERENCE.md		SELF_RAG_QUICK_REFERENCE.md
SPEC_SUMMARY.md		SPEC_SUMMARY.md
TRAINING_RESULTS.md		TRAINING_RESULTS.md
data.tar.gz		data.tar.gz
evaluate.py		evaluate.py
evaluate_final_acc.py		evaluate_final_acc.py
evaluate_run.md		evaluate_run.md
evaluate_selfrag_e2e.py		evaluate_selfrag_e2e.py
evaluate_selfrag_replay.py		evaluate_selfrag_replay.py
evaluate_selfrag_replay_all.py		evaluate_selfrag_replay_all.py
lib.py		lib.py
paper.pdf		paper.pdf
plot_selfrag_results.py		plot_selfrag_results.py
plots.md		plots.md
predict.py		predict.py
predictions.tar.gz		predictions.tar.gz
processed_data.tar.gz		processed_data.tar.gz
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py
run_evaluation.sh		run_evaluation.sh
run_gpt_baseline_all.sh		run_gpt_baseline_all.sh
run_gpt_baseline_trivia.sh		run_gpt_baseline_trivia.sh
run_retrieval_dev.sh		run_retrieval_dev.sh
run_retrieval_test.sh		run_retrieval_test.sh
runner.py		runner.py
selfrag_ablation_full_hotpotqa.json		selfrag_ablation_full_hotpotqa.json
selfrag_ablation_full_hotpotqa_detailed.json		selfrag_ablation_full_hotpotqa_detailed.json
selfrag_ablation_full_trivia.json		selfrag_ablation_full_trivia.json
selfrag_ablation_full_trivia_detailed.json		selfrag_ablation_full_trivia_detailed.json
selfrag_ablation_p1_hotpotqa.json		selfrag_ablation_p1_hotpotqa.json
selfrag_ablation_p1_hotpotqa_detailed.json		selfrag_ablation_p1_hotpotqa_detailed.json
selfrag_ablation_p1_trivia.json		selfrag_ablation_p1_trivia.json
selfrag_ablation_p1_trivia_detailed.json		selfrag_ablation_p1_trivia_detailed.json
selfrag_ablation_p1p2_hotpotqa.json		selfrag_ablation_p1p2_hotpotqa.json
selfrag_ablation_p1p2_hotpotqa_detailed.json		selfrag_ablation_p1p2_hotpotqa_detailed.json
selfrag_ablation_p1p2_trivia.json		selfrag_ablation_p1p2_trivia.json
selfrag_ablation_p1p2_trivia_detailed.json		selfrag_ablation_p1p2_trivia_detailed.json
selfrag_e2e_results_hotpotqa.json		selfrag_e2e_results_hotpotqa.json
selfrag_e2e_results_hotpotqa_detailed.json		selfrag_e2e_results_hotpotqa_detailed.json
selfrag_e2e_results_musique.json		selfrag_e2e_results_musique.json
selfrag_e2e_results_musique_detailed.json		selfrag_e2e_results_musique_detailed.json
selfrag_e2e_results_nq.json		selfrag_e2e_results_nq.json
selfrag_e2e_results_nq_detailed.json		selfrag_e2e_results_nq_detailed.json
selfrag_e2e_results_trivia.json		selfrag_e2e_results_trivia.json
selfrag_e2e_results_trivia_detailed.json		selfrag_e2e_results_trivia_detailed.json
selfrag_replay_results.json		selfrag_replay_results.json
selfrag_replay_results_all.json		selfrag_replay_results_all.json
tables.md		tables.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adaptive-RAG + Self-RAG

Overview

Setup (macOS)

1. Create Conda Environment

2. Install PyTorch (CPU)

3. Install Python Dependencies

4. Set Up Elasticsearch (Retriever Backend)

5. Download Datasets

6. Build Elasticsearch Indices

7. Start Servers

Running Evaluations

Baseline Adaptive-RAG (classifier-only)

Self-RAG Enhanced Evaluation (live inference)

Running Tests

Single-Hop Dataset Setup

Training the Classifier (Optional)

Project Structure

Performance Notes (macOS)

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Adaptive-RAG + Self-RAG

Overview

Setup (macOS)

1. Create Conda Environment

2. Install PyTorch (CPU)

3. Install Python Dependencies

4. Set Up Elasticsearch (Retriever Backend)

5. Download Datasets

6. Build Elasticsearch Indices

7. Start Servers

Running Evaluations

Baseline Adaptive-RAG (classifier-only)

Self-RAG Enhanced Evaluation (live inference)

Running Tests

Single-Hop Dataset Setup

Training the Classifier (Optional)

Project Structure

Performance Notes (macOS)

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages