Towards Intrinsic Interpretability of Large Language Models:
A Survey of Design Principles and Architectures

The first systematic review of intrinsic interpretability for LLMs, categorizing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction.

[📄 Read the Paper]

🗓️ Release

[2026/05/18] 🗣️ Our paper is selected as an Oral Presentation at ACL 2026.
[2026/4/7] 🎉 This paper is accepted by ACL 2026 main conference.
[2026/1/16] 🎉 We launch the paper list of this survey.

📖 Overview

Figure: A taxonomy of intrinsic architectural designs for interpretable LLMs. We categorize existing approaches into five primary families based on their core mechanism for transparency.

While Large Language Models (LLMs) have achieved strong performance, their opaque internal mechanisms hinder trustworthiness. Unlike post-hoc explanation methods that interpret trained models through external approximations, Intrinsic Interpretability builds transparency directly into model architectures and computations.

This survey presents the first systematic review of recent advances in intrinsic interpretability for LLMs. We distinguish intrinsic methods by their structural fidelity—ensuring that the model's internal computation is itself interpretable without relying on external surrogates. We categorize existing approaches into five core design paradigms and discuss open challenges in balancing interpretability with performance at scale.

🏷 Taxonomy & Design Principles

To help navigate the paper list, we organize studies around the Five Core Design Principles proposed in our survey:

⚙️ 1. Functional Transparency

Advocates architectures whose computations are both structurally explicit and semantically meaningful (e.g., GAMs, KANs).

🔗 2. Concept Alignment

Aligns latent representations with human-understandable concepts to reduce polysemanticity (e.g., Concept Bottleneck Models).

🧩 3. Representational Decomposability

Structures the latent space to disentangle representations into independent subspaces (e.g., Backpack models).

📦 4. Explicit Modularization

Decomposes computation into distinct, independently functioning modules with traceable routing pathways (e.g., Mixture-of-Experts).

📉 5. Latent Sparsity Induction

Induces modularity and sparsity within standard dense architectures through regularization or gating (e.g., Sparse Training).

📑 Paper List

1. Functional Transparency

Paper	Venue	Year	Link
Generalized Additive Models	Statistical Science	1986	Link
Accurate Intelligible Models with Pairwise Interactions (GA²M)	KDD	2013	Link
InterpretML: A Unified Framework for Machine Learning Interpretability (EBMs)	NeurIPS	2019	Link
GAMI-Net: An Explainable Neural Network Based on Generalized Additive Models (NAMs)	Pattern Recog.	2021	Link
KAN: Kolmogorov-Arnold Networks	ICLR	2025	Link
Bilinear MLPs Enable Weight-Based Mechanistic Interpretability	ICLR	2025	Link

2. Concept Alignment

Paper	Venue	Year	Link
Concept Bottleneck Models (Standard CBMs)	ICML	2020	Link
Stochastic Concept Bottleneck Models (SCBMs)	NeurIPS	2024	Link
Addressing Leakage in Concept Bottleneck Models (Hybrid CBMs)	NeurIPS	2022	Link
CB-LLM: Concept Bottleneck Large Language Models	ICLR	2025	Link
Label-Free Concept Bottleneck Models	ICLR	2023	Link
Concept Embedding Models (CEMs / IntCEMs)	NeurIPS	2022	Link
Codebook Features: Sparse and Discrete Interpretability for Neural Networks	ICML	2024	Link

3. Representational Decomposability

Paper	Venue	Year	Link
Backpack Language Models	ACL	2023	Link
Character-level Chinese Backpack Language Models	BlackboxNLP	2023	Link
LLM Pretraining with Continuous Concepts (CoCoMix)	ArXiv	2025	Link

4. Explicit Modularization (MoEs)

Intra-Expert Sparsity

Paper	Venue	Year	Link
Mixture of Experts Made Intrinsically Interpretable (MoE-X)	ICML	2025	Link
Pushing Mixture of Experts to the Limit (MoV)	ICLR	2024	Link
Pushing Mixture of Experts to the Limit (MoLORA)	ICLR	2024	Link

Fine-Grained Decomposition

Paper	Venue	Year	Link
MONET: Mixture of Monosemantic Experts for Transformers	ICLR	2025	Link
Towards Interpretability Without Sacrifice (MxD)	ArXiv	2025	Link
Parameter-Efficient Mixture-of-Experts Architecture (MPO-MoE)	COLING	2022	Link

Semantically Aligned Routing

Paper	Venue	Year	Link
Task-Based MoE for Multitask Multilingual Machine Translation	ArXiv	2023	Link
Sparse MoE with Language Guided Routing (Lingual-SMoE)	ICLR	2024	Link
THOR-MoE: Hierarchical Task-Guided and Context-Responsive Routing	ACL	2025	Link
Apollo-MoE: Large-Scale Communication-Efficient Training	ArXiv	2025	Link
Routing Manifold Alignment Improves Generalization (RoMA)	ArXiv	2025	Link
Unified Sparse Mixture of Experts (USMoE)	ArXiv	2025	Link
Advancing Expert Specialization for Better MoE (Orthogonality)	ArXiv	2025	Link

5. Latent Sparsity Induction

Paper	Venue	Year	Link
Weight-Sparse Transformers Have Interpretable Circuits	ArXiv	2025	Link
Language Modeling with Gated Convolutional Networks (GLUs)	ICML	2017	Link

🌟 Reference

Please cite our paper if you find this survey useful for your research.

@misc{gao2026intrinsicinterpretabilitylargelanguage,
      title={Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures}, 
      author={Yutong Gao and Qinglin Meng and Yuan Zhou and Liangming Pan},
      year={2026},
      eprint={2604.16042},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.16042}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LICENSE		LICENSE
README.md		README.md
intrinsicmethod.png		intrinsicmethod.png
paper.pdf		paper.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Intrinsic Interpretability of Large Language Models:
A Survey of Design Principles and Architectures

🗓️ Release

📖 Table of Contents

📖 Overview

🏷 Taxonomy & Design Principles

⚙️ 1. Functional Transparency

🔗 2. Concept Alignment

🧩 3. Representational Decomposability

📦 4. Explicit Modularization

📉 5. Latent Sparsity Induction

📑 Paper List

1. Functional Transparency

2. Concept Alignment

3. Representational Decomposability

4. Explicit Modularization (MoEs)

Intra-Expert Sparsity

Fine-Grained Decomposition

Semantically Aligned Routing

5. Latent Sparsity Induction

🌟 Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Towards Intrinsic Interpretability of Large Language Models: A Survey of Design Principles and Architectures

🗓️ Release

📖 Table of Contents

📖 Overview

🏷 Taxonomy & Design Principles

⚙️ 1. Functional Transparency

🔗 2. Concept Alignment

🧩 3. Representational Decomposability

📦 4. Explicit Modularization

📉 5. Latent Sparsity Induction

📑 Paper List

1. Functional Transparency

2. Concept Alignment

3. Representational Decomposability

4. Explicit Modularization (MoEs)

Intra-Expert Sparsity

Fine-Grained Decomposition

Semantically Aligned Routing

5. Latent Sparsity Induction

🌟 Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Towards Intrinsic Interpretability of Large Language Models:
A Survey of Design Principles and Architectures

Packages