An interactive web application for visualizing how prompt tokens influence LLM outputs through attention scaling interventions. Identifies overshadower tokens that encourage hallucinations and overshadowee tokens that get suppressed.
# Clone or copy the project files
cd overshadow_vis
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtpython app.py --port 8888Then open your browser to http://localhost:8888
- Load Model: Click "Load Model" to load OLMo-2-7B (requires ~14GB GPU memory)
- Enter Prompt: Type your prompt in the text area
- Generate: Click "Generate Response" to get model output
- Select Tokens: Click on response tokens to select a sequence for analysis
- Compute Attributions: Click to run attention scaling interventions
- Analyze:
- View colored prompt tokens (red=overshadower, blue=overshadowee)
- Hover over tokens to see detailed impact metrics
- Greedy Decoding (Temperature=0): Generates deterministic model outputs
- Position Attention Scaling: Applies interventions on layers 22-28
- Interactive Token Selection: Click to select output sequences for analysis
- Real-time Attribution: Computes log probability changes under interventions
- Visual Classification:
- 🔴 Red = Overshadower (scaling up increases hallucinated sequence probability)
- 🔵 Blue = Overshadowee (scaling up decreases hallucinated sequence probability)
- Detailed Tooltips: Hover to see per-scale-factor impacts
A token is classified as an overshadower when:
- Scaling DOWN (e.g. x 0.3) its positional attention decreases the highlighted sequence's log probability
- Scaling UP (e.g. x3.0) its positional attention increases the highlighted sequence's log probability
This means the token is "encouraging" the hallucinated output.
A token is classified as an overshadowee when:
- Scaling DOWN its positional attention increases the highlighted sequence's log probability
- Scaling UP its positional attention decreases the highlighted sequence's log probability
This means the token possibly contains information that is being suppressed in favor of the hallucinated output.
The intervention scales the hidden states at specific positions before they enter the attention mechanism:
scaled_hidden = hidden_states * scale_mask.view(1, -1, 1)Where scale_mask[pos] = scale_factor for target positions.
For a highlighted sequence of tokens, we compute:
log P(seq|prompt) = Σ log P(token_i | prompt, tokens_1..i-1)
The delta is: Δ = log P(seq|prompt, intervention) - log P(seq|prompt, baseline)
- Python 3.9+
- CUDA-capable GPU with ~14GB memory (for OLMo-2-7B)
- PyTorch 2.0+
- Transformers 4.35+
Prompt: "A famous rock musician from North Korea is named"
Expected Output: "Kim Jong-un ..."