Skip to content

gao-1/Towards-Intrinsic-Interpretability-of-Large-Language-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Towards Intrinsic Interpretability of Large Language Models:
A Survey of Design Principles and Architectures

Awesome License: MIT

The first systematic review of intrinsic interpretability for LLMs, categorizing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction.

[πŸ“„ Read the Paper]


πŸ—“οΈ Release

  • [2026/05/18] πŸ—£οΈ Our paper is selected as an Oral Presentation at ACL 2026.
  • [2026/4/7] πŸŽ‰ This paper is accepted by ACL 2026 main conference.
  • [2026/1/16] πŸŽ‰ We launch the paper list of this survey.

πŸ“– Table of Contents


πŸ“– Overview

Taxonomy Framework Figure: A taxonomy of intrinsic architectural designs for interpretable LLMs. We categorize existing approaches into five primary families based on their core mechanism for transparency.

While Large Language Models (LLMs) have achieved strong performance, their opaque internal mechanisms hinder trustworthiness. Unlike post-hoc explanation methods that interpret trained models through external approximations, Intrinsic Interpretability builds transparency directly into model architectures and computations.

This survey presents the first systematic review of recent advances in intrinsic interpretability for LLMs. We distinguish intrinsic methods by their structural fidelityβ€”ensuring that the model's internal computation is itself interpretable without relying on external surrogates. We categorize existing approaches into five core design paradigms and discuss open challenges in balancing interpretability with performance at scale.

🏷 Taxonomy & Design Principles

To help navigate the paper list, we organize studies around the Five Core Design Principles proposed in our survey:

βš™οΈ 1. Functional Transparency

Advocates architectures whose computations are both structurally explicit and semantically meaningful (e.g., GAMs, KANs).

πŸ”— 2. Concept Alignment

Aligns latent representations with human-understandable concepts to reduce polysemanticity (e.g., Concept Bottleneck Models).

🧩 3. Representational Decomposability

Structures the latent space to disentangle representations into independent subspaces (e.g., Backpack models).

πŸ“¦ 4. Explicit Modularization

Decomposes computation into distinct, independently functioning modules with traceable routing pathways (e.g., Mixture-of-Experts).

πŸ“‰ 5. Latent Sparsity Induction

Induces modularity and sparsity within standard dense architectures through regularization or gating (e.g., Sparse Training).


πŸ“‘ Paper List

1. Functional Transparency

Paper Venue Year Link
Generalized Additive Models Statistical Science 1986 Link
Accurate Intelligible Models with Pairwise Interactions (GAΒ²M) KDD 2013 Link
InterpretML: A Unified Framework for Machine Learning Interpretability (EBMs) NeurIPS 2019 Link
GAMI-Net: An Explainable Neural Network Based on Generalized Additive Models (NAMs) Pattern Recog. 2021 Link
KAN: Kolmogorov-Arnold Networks ICLR 2025 Link
Bilinear MLPs Enable Weight-Based Mechanistic Interpretability ICLR 2025 Link

2. Concept Alignment

Paper Venue Year Link
Concept Bottleneck Models (Standard CBMs) ICML 2020 Link
Stochastic Concept Bottleneck Models (SCBMs) NeurIPS 2024 Link
Addressing Leakage in Concept Bottleneck Models (Hybrid CBMs) NeurIPS 2022 Link
CB-LLM: Concept Bottleneck Large Language Models ICLR 2025 Link
Label-Free Concept Bottleneck Models ICLR 2023 Link
Concept Embedding Models (CEMs / IntCEMs) NeurIPS 2022 Link
Codebook Features: Sparse and Discrete Interpretability for Neural Networks ICML 2024 Link

3. Representational Decomposability

Paper Venue Year Link
Backpack Language Models ACL 2023 Link
Character-level Chinese Backpack Language Models BlackboxNLP 2023 Link
LLM Pretraining with Continuous Concepts (CoCoMix) ArXiv 2025 Link

4. Explicit Modularization (MoEs)

Intra-Expert Sparsity

Paper Venue Year Link
Mixture of Experts Made Intrinsically Interpretable (MoE-X) ICML 2025 Link
Pushing Mixture of Experts to the Limit (MoV) ICLR 2024 Link
Pushing Mixture of Experts to the Limit (MoLORA) ICLR 2024 Link

Fine-Grained Decomposition

Paper Venue Year Link
MONET: Mixture of Monosemantic Experts for Transformers ICLR 2025 Link
Towards Interpretability Without Sacrifice (MxD) ArXiv 2025 Link
Parameter-Efficient Mixture-of-Experts Architecture (MPO-MoE) COLING 2022 Link

Semantically Aligned Routing

Paper Venue Year Link
Task-Based MoE for Multitask Multilingual Machine Translation ArXiv 2023 Link
Sparse MoE with Language Guided Routing (Lingual-SMoE) ICLR 2024 Link
THOR-MoE: Hierarchical Task-Guided and Context-Responsive Routing ACL 2025 Link
Apollo-MoE: Large-Scale Communication-Efficient Training ArXiv 2025 Link
Routing Manifold Alignment Improves Generalization (RoMA) ArXiv 2025 Link
Unified Sparse Mixture of Experts (USMoE) ArXiv 2025 Link
Advancing Expert Specialization for Better MoE (Orthogonality) ArXiv 2025 Link

5. Latent Sparsity Induction

Paper Venue Year Link
Weight-Sparse Transformers Have Interpretable Circuits ArXiv 2025 Link
Language Modeling with Gated Convolutional Networks (GLUs) ICML 2017 Link

🌟 Reference

Please cite our paper if you find this survey useful for your research.

@misc{gao2026intrinsicinterpretabilitylargelanguage,
      title={Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures}, 
      author={Yutong Gao and Qinglin Meng and Yuan Zhou and Liangming Pan},
      year={2026},
      eprint={2604.16042},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.16042}, 
}

About

A Survey of Design Principles and Architectures

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors