
The rapid spread of artificial intelligence (AI) across fields ranging from medicine to social systems has raised growing concerns about how AI makes decisions. Even experts admit that the internal workings of these systems often remain a “black box,” especially in critical applications.
To address this challenge, researchers are adopting methods inspired by biology and neuroscience. One key approach, known as mechanistic interpretability, aims to trace internal processes within AI models during task execution. Tools developed by Anthropic visualize neural network activity in a way comparable to MRI scans of the human brain.
Another approach mirrors biological organoid research by creating specialized networks such as sparse autoencoders, which are easier to analyze than large language models (LLMs).
Researchers are also using chain-of-thought monitoring, where AI systems explain the reasoning behind their actions. According to OpenAI researcher Bowen Baker, this method has proven effective in identifying undesirable model behaviors.
Scientists warn that future AI systems may become so complex—especially if developed by AI itself—that understanding their behavior could become nearly impossible.
Keywords