Skip to content

d3bn4th/AdaRouteRAG

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adaptive-RAG + Self-RAG

Based on "Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity" (NAACL 2024), with Self-RAG reflection layers for improved answer quality.

Overview

Adaptive-RAG Overview

Adaptive-RAG dynamically selects retrieval strategies based on query complexity:

  • Strategy A (no retrieval) — direct LLM answer
  • Strategy B (single-step retrieval) — one retrieval pass + LLM
  • Strategy C (multi-step / IRCoT) — iterative chain-of-thought retrieval

A t5-large classifier routes each query to the appropriate strategy.

This fork adds Self-RAG (Self-Reflective RAG) as an inference-time enhancement:

  1. Phase 1 — Relevance Filtering: Embedding + LLM hybrid check on retrieved documents
  2. Phase 2 — Answer Verification: Support level assessment (fully/partially/not supported)
  3. Phase 3 — Confidence Scoring & Escalation: Routes to more complex strategies when confidence is low

Setup (macOS)

1. Create Conda Environment

conda create -n adaptiverag python=3.8
conda activate adaptiverag

2. Install PyTorch (CPU)

pip install 'torch>=1.7,!=1.12.0,<2.0'

On Apple Silicon (M1/M2/M3), PyTorch supports MPS acceleration out of the box. The LLM server uses CPU by default.

3. Install Python Dependencies

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Key dependencies:

Package Version Purpose
torch >=1.7, <2.0 LLM inference
transformers git pin (8637316) Model loading
sentence-transformers 2.2.2 Embedding-based relevance (Self-RAG Phase 1)
huggingface_hub 0.36.1 Model downloads
pydantic >=1.10.26, <2.0 Config validation
spacy 3.4.4 NLP (Strategy C / IRCoT)
hypothesis 6.113.0 Property-based tests
elasticsearch 7.9.1 BM25 retrieval

4. Set Up Elasticsearch (Retriever Backend)

macOS (Homebrew):

# Option A: Homebrew (recommended for Apple Silicon)
brew tap elastic/tap
brew install elastic/tap/elasticsearch-full@7
elasticsearch  # starts on localhost:9200

# Option B: Manual download
# For Intel Mac:
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-darwin-x86_64.tar.gz
tar -xzf elasticsearch-7.10.2-darwin-x86_64.tar.gz

# For Apple Silicon (M1/M2/M3) — use the no-jdk version + install JDK separately:
# brew install openjdk@17
# wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-no-jdk-darwin-x86_64.tar.gz

cd elasticsearch-7.10.2/
./bin/elasticsearch  # starts on localhost:9200

Verify it's running:

curl localhost:9200/_cat/health

5. Download Datasets

# Multi-hop datasets (MuSiQue, HotpotQA, 2WikiMultiHopQA) — preprocessed test sets
bash ./download/processed_data.sh

# Single-hop datasets (NQ, TriviaQA, SQuAD) — requires manual download from DPR
# See "Single-Hop Dataset Setup" section below

# Raw data for training the classifier (optional — only needed if retraining)
bash ./download/raw_data.sh

6. Build Elasticsearch Indices

# Multi-hop datasets (each uses its own corpus)
python retriever_server/build_index.py hotpotqa
python retriever_server/build_index.py 2wikimultihopqa
python retriever_server/build_index.py musique

# Single-hop datasets (all share the Wikipedia corpus)
python retriever_server/build_index.py wiki

Verify indices:

curl localhost:9200/_cat/indices
# Expected sizes: hotpotqa (5,233,329), 2wikimultihopqa (430,225),
# musique (139,416), wiki (21,015,324)

7. Start Servers

You need two servers running in separate terminals:

Terminal 1 — Retriever server:

conda activate adaptiverag
uvicorn serve:app --port 8000 --app-dir retriever_server

Terminal 2 — LLM server (flan-t5-xl):

conda activate adaptiverag
MODEL_NAME=flan-t5-xl uvicorn serve:app --port 8010 --app-dir llm_server

First run downloads the flan-t5-xl model (~12 GB). Subsequent runs use the cached model. On a Mac with 16 GB RAM, expect ~8-10 GB memory usage for flan-t5-xl.


Running Evaluations

Baseline Adaptive-RAG (classifier-only)

python evaluate_final_acc.py

Self-RAG Enhanced Evaluation (live inference)

# Dry run (5 questions)
python evaluate_selfrag_e2e.py --dataset trivia --max-questions 5

# Single dataset (full ~500 questions)
python evaluate_selfrag_e2e.py --dataset trivia

# All 6 datasets
python evaluate_selfrag_e2e.py --all-datasets

# Available datasets: nq, trivia, squad, musique, hotpotqa, 2wikimultihopqa

Output files:

  • selfrag_e2e_results_{dataset}.json — aggregate metrics
  • selfrag_e2e_results_{dataset}_detailed.json — per-question results

Running Tests

pytest tests/ -v

Single-Hop Dataset Setup

Download from Facebook DPR:

# Natural Questions
mkdir -p raw_data/nq && cd raw_data/nq
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-dev.json.gz
gzip -d biencoder-nq-dev.json.gz
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-train.json.gz
gzip -d biencoder-nq-train.json.gz
cd ../..

# TriviaQA
mkdir -p raw_data/trivia && cd raw_data/trivia
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-trivia-dev.json.gz
gzip -d biencoder-trivia-dev.json.gz
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-trivia-train.json.gz
gzip -d biencoder-trivia-train.json.gz
cd ../..

# SQuAD
mkdir -p raw_data/squad && cd raw_data/squad
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-squad1-dev.json.gz
gzip -d biencoder-squad1-dev.json.gz
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-squad1-train.json.gz
gzip -d biencoder-squad1-train.json.gz
cd ../..

# Wikipedia passages (shared corpus for all single-hop datasets)
mkdir -p raw_data/wiki && cd raw_data/wiki
wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
gzip -d psgs_w100.tsv.gz
cd ../..

# Process and subsample
python ./processing_scripts/process_nq.py
python ./processing_scripts/process_trivia.py
python ./processing_scripts/process_squad.py
python processing_scripts/subsample_dataset_and_remap_paras.py nq test 500
python processing_scripts/subsample_dataset_and_remap_paras.py trivia test 500
python processing_scripts/subsample_dataset_and_remap_paras.py squad test 500

Training the Classifier (Optional)

If you want to retrain the query complexity classifier:

# Generate training data from retrieval strategy predictions
SYSTEM=ircot_qa MODEL=flan-t5-xl DATASET=nq LLM_PORT_NUM=8010
bash run_retrieval_dev.sh $SYSTEM $MODEL $DATASET $LLM_PORT_NUM
# Repeat for all 6 datasets × 3 strategies (ircot_qa, oner_qa, nor_qa)

# Preprocess classifier training data
python ./classifier/preprocess/preprocess_silver_train.py flan_t5_xl
python ./classifier/preprocess/preprocess_silver_valid.py flan_t5_xl
python ./classifier/preprocess/preprocess_binary_train.py
python ./classifier/preprocess/concat_binary_silver_train.py

# Train
cd classifier
bash ./run/run_large_train_xl.sh
cd ..

# Generate predictions
python ./classifier/postprocess/predict_complexity_on_classification_results.py

Pre-computed classifier predictions are provided in predictions/classifier/.


Project Structure

├── commaqa/                    # Core inference library
│   ├── inference/
│   │   ├── self_rag/           # Self-RAG modules (added)
│   │   │   ├── relevance_checker.py   # Phase 1: embedding + LLM relevance
│   │   │   ├── answer_verifier.py     # Phase 2: support level verification
│   │   │   ├── confidence_scorer.py   # Phase 3: confidence scoring
│   │   │   ├── strategy_escalator.py  # Escalation routing logic
│   │   │   ├── config.py              # SelfRAGConfig dataclass
│   │   │   └── stats_tracker.py       # Metrics tracking
│   │   ├── configurable_inference.py  # Pipeline builder
│   │   ├── ircot.py                   # Strategy C (multi-step)
│   │   └── participant_execution_routed_selfrag.py  # Self-RAG participant
│   └── models/
│       ├── llm_client_generator.py    # LLM server client
│       └── gpt3generator.py           # OpenAI API client
├── classifier/                 # Query complexity classifier (t5-large)
├── retriever_server/           # Elasticsearch BM25 retriever
├── llm_server/                 # FLAN-T5 inference server (uvicorn)
├── tests/                      # Property-based + unit tests
├── evaluate_selfrag_e2e.py     # Self-RAG end-to-end evaluation
├── evaluate_final_acc.py       # Baseline Adaptive-RAG evaluation
├── base_configs/               # Jsonnet configs per dataset × strategy × model
├── predictions/                # Pre-computed baseline predictions
├── processed_data/             # Preprocessed test/dev JSONL files
└── requirements.txt

Performance Notes (macOS)

  • flan-t5-xl (~3B params): ~2-5 seconds per LLM call on CPU. Strategy C questions can take 60-90s each due to multi-step reasoning (up to 10 steps × multiple LLM calls).
  • Full 500-question dataset: ~1-2 hours for Strategy A/B heavy datasets, longer for Strategy C heavy ones.
  • All 6 datasets: ~6-12 hours total.
  • Memory: flan-t5-xl needs ~8-10 GB RAM. Elasticsearch needs ~2-4 GB depending on indices loaded.
  • Disk: ~12 GB for flan-t5-xl model cache, ~5 GB for Elasticsearch indices, ~2 GB for datasets.

Acknowledgement

Based on the IRCoT skeleton code and the Adaptive-RAG repository.

Citation

@inproceedings{jeong2024adaptiverag,
  author       = {Soyeong Jeong and Jinheon Baek and Sukmin Cho and Sung Ju Hwang and Jong Park},
  title        = {Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity},
  booktitle    = {NAACL},
  year         = {2024},
  url          = {https://arxiv.org/abs/2403.14403}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 53.8%
  • Jsonnet 44.9%
  • Shell 1.2%
  • Dockerfile 0.1%