Fine-tuning affects Personas in LLMs

This repo contains experiments from my research internship at IIIT Hyderabad studying how fine-tuning shifts personality trait representations in LLMs particularly covering Big Five and Dark Triad traits. We demonstrate that personality traits are learnable, steerable directions in transformer activation space.

Setup

Environment Configuration

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Key dependencies include:

torch - PyTorch for model training
transformers - HuggingFace transformers library
datasets - For data loading and processing
peft - Parameter-efficient fine-tuning (LoRA support)
accelerate - Distributed training utilities
scikit-learn - For AUC and evaluation metrics
openai - For judge models in evaluation

Configure API keys (if using Anthropic or OpenAI models):

cp .env.example .env
# Fill in your API keys in the .env file

Dataset Preparation

The TRAIT dataset is pre-configured with personality trait annotations. Key dataset locations:

persona-anthropic/training_datasets/ - Training data for personality traits
TRAIT/TRAIT.json - Complete TRAIT evaluation dataset with personality scenarios

Methodology

Persona Vector Extraction: Difference-of-Means Approach

The core methodology uses the difference-of-means technique to extract steerable personality vectors from fine-tuned language models:

1. Data Collection

Collect model activations for persona-on vs persona-off behaviors
Use positive and negative system prompts (e.g., "You are a helpful assistant" vs "You are an evil assistant")
Generate responses to personality-targeted scenarios from the TRAIT dataset
Record activations at selected transformer layers

2. Vector Computation

Extract hidden states at each transformer layer
Calculate mean activations for personas: μ_on and μ_off
Compute persona vector: v = μ_on - μ_off
This creates a discoverable direction in activation space representing the personality trait

3. Vector Quality Assessment

AUC Scores: Measure separation between high and low trait conditions
- AUC > 0.8: Excellent separation
- AUC 0.6-0.8: Good separation
- AUC < 0.6: Poor separation
Mean Separation: Magnitude of difference in activations
Projection Scores: How well responses separate along the vector direction
Variance Patterns: Inter-layer consistency of trait representation

Fine-tuning Strategy

Models are fine-tuned on personality-targeted datasets using instruction-following approach:

Positive examples emphasize high-trait behaviors
Negative examples show low-trait behaviors
Multiple training datasets target different personality traits:
- Openness to experience
- Machiavellianism
- And other personality dimensions

Trait Measurement Framework (TRAIT Dataset)

The TRAIT dataset provides:

Personality scenarios: Situational context requiring personality-aligned responses
High trait responses: Behaviors aligned with the personality dimension
Low trait responses: Behaviors opposing the personality dimension
Evaluation queries: Questions to assess personality manifestation
Multiple traits and splits for comprehensive evaluation

Evaluation Pipeline

Baseline Evaluation: Test models without interventions
System Prompt Evaluation: Generate activations using:
- Positive instructions: "You are a [trait] assistant"
- Negative instructions: "You are a [opposite_trait] assistant"
Vector Extraction: Compute difference-of-means vectors
Steering Application: Add vectors to activations during inference at specified layers
Quality Verification: Assess changes in model responses

Results

Openness Trait Analysis

Vector Quality Metrics:

Layer 8: AUC ≈ 0.89 (Excellent separation)
Layer 9: AUC ≈ 0.99 (Near-perfect separation) ⭐
Layer 12: AUC ≈ 0.89 (Excellent separation)

Activation Separation:

Layer 8: Separation ≈ 7.2
Layer 9: Separation ≈ 8.5
Layer 12: Separation ≈ 9.9 (Highest magnitude) ⭐

Key Findings:

All tested layers show very high AUC scores (>0.8), indicating strong personality encoding
Layer 9 optimal for quality with AUC ≈ 1.0
Layer 12 optimal for magnitude with highest separation value
Consistent high performance indicates successful fine-tuning

Steering Recommendations:

For quality-focused control: Use Layer 9 vectors
For maximum effect: Use Layer 12 vectors
For robust steering: Either layer works excellently
Starting steering strength: ±1.0 (vectors are very powerful)

Machiavellianism Trait Analysis

Vector Quality Metrics:

Layers 8, 9, 11-14: Perfect or near-perfect AUC = 1.0
Layer 10: AUC ≈ 0.98 (Still excellent)
Outstanding linear separability between high and low conditions

Activation Separation:

Progressive improvement from layer 8 to 14
Best performance: Layers 13-14 with highest separation (~1.8)
Layer 8: Lowest separation (~1.49) but still strong

Variance Patterns:

Clear difference between persona-on and persona-off variance
Peak separation at layer 10
Evidence of layer-specific personality processing

Key Findings:

Perfect classification (AUC ≈ 1.0) across multiple layers
High mean differences indicate powerful steering potential
Multiple optimal layers provide flexibility
Well-trained model with robust personality encoding

Steering Recommendations:

For maximum effect: Use Layer 13-14 vectors
For balanced control: Use Layer 10 vectors
For baseline performance: Use Layer 8-9 vectors
Starting steering strength: ±2.0 (vectors are very powerful)

Cross-Trait Performance Summary

Metric	Openness	Machiavellianism
AUC Score	0.89-1.0	0.98-1.0
Separation	7.2-9.9	1.49-1.8
Layer Coverage	8-12	8-14
Fine-tuning Quality	✓ Excellent	✓ Excellent

Scientific Implications

1. Internal Personality Representation

Evidence: Perfect AUC scores demonstrate linear separability of personality traits
Finding: LLMs develop genuine, measurable internal representations for personality traits during fine-tuning, beyond surface-level mimicry

2. Layered Personality Processing

Evidence: Different layers encode personality with varying strengths and consistency
Finding: Early layers (8-9) provide baseline encoding; middle layers (10-11) show peak variance; late layers (13-14) maximize steering magnitude

3. Controllable Personality Emergence

Evidence: Consistent separability across layers with system prompt conditioning
Finding: Personality traits can be reliably induced and controlled through prompt engineering with fine-tuned models

4. Geometric Personality Space

Evidence: Clean linear separation enabling simple vector arithmetic (μ_high - μ_low)
Finding: Personality traits exist as discoverable directions in the model's activation space, enabling mathematical analysis and manipulation

Safety and Interpretability Outcomes

Auditability: Personality traits are measurable, enabling model audits for personality biases
Controllability: Precise personality control possible without model retraining
Predictability: Changes follow linear relationships across steering strengths
Mechanistic Understanding: Reveals how models encode personality across layers

Conclusion

Fine-tuning successfully embeds personality traits into language models as learnable, steerable directions in activation space. The consistent near-perfect AUC scores and high separation metrics demonstrate that:

Models develop genuine personality representations, not just behavioral mimicry
These representations are geometrically structured and mathematically manipulable
Different transformer layers specialize in personality encoding
Personality traits can be controlled precisely through activation steering

This provides a foundation for building controllable, interpretable AI systems with measurable personality properties.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
TRAIT		TRAIT
model_organisms		model_organisms
persona-analysis		persona-analysis
persona-anthropic		persona-anthropic
persona-vectors		persona-vectors
persona		persona
qwen2.5-0.6B/0.6B-auto-incorrect		qwen2.5-0.6B/0.6B-auto-incorrect
qwen2.5-1.5B		qwen2.5-1.5B
qwen3		qwen3
steering		steering
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fine-tuning affects Personas in LLMs

Setup

Environment Configuration

Dataset Preparation

Methodology

Persona Vector Extraction: Difference-of-Means Approach

1. Data Collection

2. Vector Computation

3. Vector Quality Assessment

Fine-tuning Strategy

Trait Measurement Framework (TRAIT Dataset)

Evaluation Pipeline

Results

Openness Trait Analysis

Machiavellianism Trait Analysis

Cross-Trait Performance Summary

Scientific Implications

1. Internal Personality Representation

2. Layered Personality Processing

3. Controllable Personality Emergence

4. Geometric Personality Space

Safety and Interpretability Outcomes

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fine-tuning affects Personas in LLMs

Setup

Environment Configuration

Dataset Preparation

Methodology

Persona Vector Extraction: Difference-of-Means Approach

1. Data Collection

2. Vector Computation

3. Vector Quality Assessment

Fine-tuning Strategy

Trait Measurement Framework (TRAIT Dataset)

Evaluation Pipeline

Results

Openness Trait Analysis

Machiavellianism Trait Analysis

Cross-Trait Performance Summary

Scientific Implications

1. Internal Personality Representation

2. Layered Personality Processing

3. Controllable Personality Emergence

4. Geometric Personality Space

Safety and Interpretability Outcomes

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages