Large language models display impressive capabilities. However, for the most part, the mechanisms by which they do so are unknown. The black-box nature of models is increasingly unsatisfactory as they advance in intelligence and are deployed in a growing number of applications. Our goal is to reverse engineer how these models work on the inside, so we may better understand them and assess their fitness for purpose.
The challenges we face in understanding language models resemble those faced by biologists. Living organisms are complex systems which have been sculpted by billions of years of evolution. While the basic principles of evolution are straightforward, the biological mechanisms it produces are spectacularly intricate. Likewise, while language models are generated by simple, human-designed training algorithms, the mechanisms born of these algorithms appear to be quite complex.
Article: https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Blog + video: https://www.anthropic.com/research/tracing-thoughts-language-model
- Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task - June 19, 2025
- The Memory Paradox: Why Our Brains Need Knowledge in an Age of AI - June 13, 2025
- Artificial Intelligence Software to Accelerate Screening for Living Systematic Reviews - June 13, 2025