Towards Intrinsic Interpretability of Large Language Models:
A Survey of Design Principles and Architectures
The first systematic review of intrinsic interpretability for LLMs, categorizing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction.
- [2026/05/18] π£οΈ Our paper is selected as an Oral Presentation at ACL 2026.
- [2026/4/7] π This paper is accepted by ACL 2026 main conference.
- [2026/1/16] π We launch the paper list of this survey.
Figure: A taxonomy of intrinsic architectural designs for interpretable LLMs. We categorize existing approaches into five primary families based on their core mechanism for transparency.
While Large Language Models (LLMs) have achieved strong performance, their opaque internal mechanisms hinder trustworthiness. Unlike post-hoc explanation methods that interpret trained models through external approximations, Intrinsic Interpretability builds transparency directly into model architectures and computations.
This survey presents the first systematic review of recent advances in intrinsic interpretability for LLMs. We distinguish intrinsic methods by their structural fidelityβensuring that the model's internal computation is itself interpretable without relying on external surrogates. We categorize existing approaches into five core design paradigms and discuss open challenges in balancing interpretability with performance at scale.
To help navigate the paper list, we organize studies around the Five Core Design Principles proposed in our survey:
Advocates architectures whose computations are both structurally explicit and semantically meaningful (e.g., GAMs, KANs).
Aligns latent representations with human-understandable concepts to reduce polysemanticity (e.g., Concept Bottleneck Models).
Structures the latent space to disentangle representations into independent subspaces (e.g., Backpack models).
Decomposes computation into distinct, independently functioning modules with traceable routing pathways (e.g., Mixture-of-Experts).
Induces modularity and sparsity within standard dense architectures through regularization or gating (e.g., Sparse Training).
| Paper | Venue | Year | Link |
|---|---|---|---|
| Generalized Additive Models | Statistical Science | 1986 | Link |
| Accurate Intelligible Models with Pairwise Interactions (GAΒ²M) | KDD | 2013 | Link |
| InterpretML: A Unified Framework for Machine Learning Interpretability (EBMs) | NeurIPS | 2019 | Link |
| GAMI-Net: An Explainable Neural Network Based on Generalized Additive Models (NAMs) | Pattern Recog. | 2021 | Link |
| KAN: Kolmogorov-Arnold Networks | ICLR | 2025 | Link |
| Bilinear MLPs Enable Weight-Based Mechanistic Interpretability | ICLR | 2025 | Link |
| Paper | Venue | Year | Link |
|---|---|---|---|
| Concept Bottleneck Models (Standard CBMs) | ICML | 2020 | Link |
| Stochastic Concept Bottleneck Models (SCBMs) | NeurIPS | 2024 | Link |
| Addressing Leakage in Concept Bottleneck Models (Hybrid CBMs) | NeurIPS | 2022 | Link |
| CB-LLM: Concept Bottleneck Large Language Models | ICLR | 2025 | Link |
| Label-Free Concept Bottleneck Models | ICLR | 2023 | Link |
| Concept Embedding Models (CEMs / IntCEMs) | NeurIPS | 2022 | Link |
| Codebook Features: Sparse and Discrete Interpretability for Neural Networks | ICML | 2024 | Link |
| Paper | Venue | Year | Link |
|---|---|---|---|
| Backpack Language Models | ACL | 2023 | Link |
| Character-level Chinese Backpack Language Models | BlackboxNLP | 2023 | Link |
| LLM Pretraining with Continuous Concepts (CoCoMix) | ArXiv | 2025 | Link |
| Paper | Venue | Year | Link |
|---|---|---|---|
| Mixture of Experts Made Intrinsically Interpretable (MoE-X) | ICML | 2025 | Link |
| Pushing Mixture of Experts to the Limit (MoV) | ICLR | 2024 | Link |
| Pushing Mixture of Experts to the Limit (MoLORA) | ICLR | 2024 | Link |
| Paper | Venue | Year | Link |
|---|---|---|---|
| MONET: Mixture of Monosemantic Experts for Transformers | ICLR | 2025 | Link |
| Towards Interpretability Without Sacrifice (MxD) | ArXiv | 2025 | Link |
| Parameter-Efficient Mixture-of-Experts Architecture (MPO-MoE) | COLING | 2022 | Link |
| Paper | Venue | Year | Link |
|---|---|---|---|
| Task-Based MoE for Multitask Multilingual Machine Translation | ArXiv | 2023 | Link |
| Sparse MoE with Language Guided Routing (Lingual-SMoE) | ICLR | 2024 | Link |
| THOR-MoE: Hierarchical Task-Guided and Context-Responsive Routing | ACL | 2025 | Link |
| Apollo-MoE: Large-Scale Communication-Efficient Training | ArXiv | 2025 | Link |
| Routing Manifold Alignment Improves Generalization (RoMA) | ArXiv | 2025 | Link |
| Unified Sparse Mixture of Experts (USMoE) | ArXiv | 2025 | Link |
| Advancing Expert Specialization for Better MoE (Orthogonality) | ArXiv | 2025 | Link |
| Paper | Venue | Year | Link |
|---|---|---|---|
| Weight-Sparse Transformers Have Interpretable Circuits | ArXiv | 2025 | Link |
| Language Modeling with Gated Convolutional Networks (GLUs) | ICML | 2017 | Link |
Please cite our paper if you find this survey useful for your research.
@misc{gao2026intrinsicinterpretabilitylargelanguage,
title={Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures},
author={Yutong Gao and Qinglin Meng and Yuan Zhou and Liangming Pan},
year={2026},
eprint={2604.16042},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.16042},
}