Skip to content

sohv/non-identifiability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

On the Non-Identifiability of Steering Vectors in Large Language Models

arXiv YouTube

This repository provides empirical validation of non-identifiability in persona steering vectors for language models.

Installation

Setup

uv sync

HuggingFace Authentication

Authenticate with HuggingFace to access both models:

huggingface-cli login

Configuration

Configuration files are located in the config/ directory:

  • prompts.json: Persona prompts for all traits
  • config.yml: Model configurations

Running Experiments

Available Traits: formality, politeness, sentiment, truthfulness, and agreeableness. You can specify any combination of these traits.

Test orthogonal component irrelevance:

python src/experiments/test_orthogonal.py --traits formality politeness sentiment truthfulness agreeableness --n_seeds 10 --model Qwen/Qwen2.5-3B-Instruct

Test alpha sweep (varying steering strength):

python src/experiments/alpha_sweep.py --traits formality politeness sentiment truthfulness agreeableness --alphas 0.0 0.5 1.0 2.0 --n_seeds 10 --model Qwen/Qwen2.5-3B-Instruct

Test multi-environment validation:

python src/experiments/multi_environment_validation.py --traits formality politeness sentiment truthfulness agreeableness --model Qwen/Qwen2.5-3B-Instruct

Test logit distance equivalence:

python src/experiments/logit_distance_equivalence_test.py --traits formality politeness sentiment truthfulness agreeableness

Test vector equivalence (non-orthogonal):

python src/experiments/test_vector_equivalence.py --models Qwen/Qwen2.5-3B-Instruct meta-llama/Llama-3.1-8B-Instruct --traits formality politeness sentiment truthfulness agreeableness

Measure null-space dimensionality:

python src/experiments/nullspace_dimensionality.py

Test null-space spanning (subspace equivalence):

python src/experiments/nullspace_spanning.py --trait formality --n_individual_checks 50 --n_subspace_samples 5

Project Structure

.
├── config/
│   ├── prompts.json
│   ├── config.yml
│   └── style.yaml
├── src/
│   └── experiments/
│       ├── persona_vector_experiment.py
│       ├── test_orthogonal.py
│       ├── alpha_sweep.py
│       ├── multi_environment_validation.py
│       ├── logit_distance_equivalence_test.py
│       ├── test_vector_equivalence.py
│       ├── nullspace_dimensionality.py
│       └── nullspace_spanning.py
└── data/

Citation

If you use this repository in your research, please cite:

@article{venkatesh2026non,
  title={On the Non-Identifiability of Steering Vectors in Large Language Models},
  author={Venkatesh, Sohan and Mahendran Kurapath, Ashish},
  journal={arXiv e-prints},
  pages={arXiv--2602},
  year={2026}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages