Can We Break Open AI’s Black Box?
Our understanding of how artificial intelligence ‘reasons’ is startlingly limited. Researchers are starting to fix that.
This is an excerpt from an article I wrote that was originally published in the Chicago Booth Review, a publication of the University of Chicago Booth School of Business.
Demis Hassabis is not a chemist, yet he was one of three recipients of the 2024 Nobel Prize in Chemistry. The prize recognized major contributions to the study of protein structures. Hassabis, a computer scientist who runs Google’s AI research lab DeepMind, and his fellow honoree John Jumper, who also works at DeepMind, developed an AI prediction model that the chair of the Nobel committee said fulfilled “a 50-year-old dream: predicting protein structures from their amino acid sequences.” Another committee member called it “one of the really first big scientific breakthroughs of AI.”
For decades, uncovering the shape of a single protein meant spending months, even years, of painstaking lab work and hundreds of thousands of dollars toward research and development with no guarantee of success.
With DeepMind’s deep-learning model AlphaFold2, revealing these structures takes minutes, not months. The DeepMind team trained AlphaFold2 with data from lab-determined protein shapes, along with extra examples it created on its own from patterns found in huge protein-sequence databases. The model examined protein shapes and amino acid sequences to determine the physical and evolutionary constraints dictating protein structure.
The team has since predicted more than 200 million protein structures and made them freely available, creating a global resource for scientific research.
AlphaFold2 is one in a growing list of scientific breakthroughs driven by AI. It also represents a new paradigm in scientific discovery: AI models that achieve breakthroughs in ways their creators can’t fully explain. While traditional science builds understanding through hypotheses we can test and verify, these AI systems are discovering solutions by finding patterns in data that remain opaque to human analysis.
There is currently no easy way to examine what AlphaFold2 learned about protein evolution. Its inner workings, and those of other AI systems making important contributions to science and society, remain hidden.
As these models get better, the gap between their performance and our understanding of them is only widening.
Nonetheless, AI adoption is racing ahead. Modern AI works incredibly well. The latest models can perform tasks that, 10 years ago, sounded like science fiction: generating movie-quality videos from a few lines of text, writing entire codebases for working apps, even driving cars without human input.
These advances have quickly entered our personal and professional lives. But this rapid deployment of black-box systems creates a fundamental tension in our relationship with AI: We’re becoming dependent on tools that have reasoning we can’t verify or build upon.
“You can go as crazy as you want and build the biggest, deepest neural network and still have interpretability baked in from the beginning.”
— Bryon Aragam
Even the architects of modern AI admit to being troubled by their lack of insight.
Dario Amodei, a cofounder of the AI lab Anthropic and the company’s CEO, wrote in April 2025: “People outside the field are often surprised and alarmed to learn that we do not understand how our own AI creations work. They are right to be concerned: this lack of understanding is essentially unprecedented in the history of technology.”
This has made interpretability, the science of cracking open AI’s “mind,” a pressing priority, and a new wave of research is taking a novel approach. AI-interpretability research has long been a form of detective work done after an AI system has already been trained and deployed. By then, the AI has already “decided” which data matter most when making its predictions.
Instead of trying to work backward to understand AI models after they’re built, scientists are now using new research frameworks to build interpretability into the training process from the start—a notion considered impossible just a few years ago.
Reverse engineering AI
When researchers try to parse the reasoning of an AI model after it has already been fully developed, they are essentially trying to reverse engineer a system that, in many ways, built itself, and attempting to uncover the internal patterns and definitions it formed along the way.
“People think these things are built systems, but they’re really not built per se,” says Ted Sumers, a researcher at Anthropic. “It’s much more like growing a plant than building a building.”
Understanding how a model “grows” has become a central focus for researchers.
One branch of this work, called mechanistic interpretability, maps which neurons activate when a user asks AI a question, and traces how information flows through the network’s intricate layers.
Anthropic, a rival to OpenAI, has been at the vanguard of this type of approach, dissecting neural networks by studying the roles of individual neurons and circuits.
This has yielded practical results. Teams can, without damaging overall performance, identify and remove specific circuits that lead to biased or unwanted outputs. They can also locate the exact parts of a model that enforce safety rules—like refusal to answer harmful queries—and adjust those directly. Since the techniques go down to the neuron level, they offer a way to audit whether a model is memorizing sensitive data. Together, these advances make models easier to edit, test, and trust as they continue to grow more capable.
Still, it’s like peering into a house through a keyhole.
The struggle to understand AI
Unlike traditional software, which relies on top-down, hard-coded rules, a neural network—a type of artificial-intelligence model that’s often described as resembling the structure of a human brain—learns from the bottom up, ingesting training data and making internal adjustments based on what it observes. Such models learn patterns from massive datasets, some with trillions of data points.
For example, to learn to identify pictures of dogs, a neural network reviews millions of labeled images of the animals rather than relying on a fixed set of definitions.
During training, the model guesses what each image shows and compares its answer to the correct label. If it guesses “not dog” for an image labeled “dog,” it recognizes the mistake and adjusts its internal settings to reduce the error. This process repeats again and again.
After enough examples, it becomes very good at identifying dogs. But it does so in a way that’s fundamentally different from how humans process and recall information. AI relies on statistical analysis to identify patterns, rather than mental imagery.
AI doesn’t “see” the way we do.
It sees the world through numerical representations of data. All types of data that AI works with—whether text, images, or audio—are converted into numbers that the system can mathematically manipulate. For example, the sentence, “AI sees the world through numerical representations of data” is converted into: [17527, 27432, 290, 2375, 1819, 57979, 63700, 328, 1238] according to OpenAI’s tool, which displays how a piece of text might be tokenized by a language model. (Different models tokenize the same words differently.)
Turning data into strings of numbers makes them usable by AI models. Computers may not be able to see or read in the traditional sense, but they can run mathematical operations on numbers. That’s how AI detects patterns, compares inputs, and ultimately learns from data instead of relying on fixed rules.
These numbers aren’t stored in a database or an Excel spreadsheet. They exist in what’s called a high-dimensional space.
We can visualize and understand the difference between two and three dimensions. Schoolchildren are taught that a rectangle has two dimensions—length and width. A cube adds another dimension: depth. It’s much harder for us to grasp a fourth dimension.
But AI can understand hundreds, even thousands, of dimensions.
To navigate these vast high-dimensional spaces, a model learns during training which direction matters most by adjusting its internal weights—think of them as groups of dials that turn up or down to chart a course through this mathematical terrain. As training progresses, turning up weights for “furry” and “four legs” steers it deeper into dog country, while dialing down irrelevant features such as “fire hydrants” prevents the model from wandering into dead ends.
Through training, the model groups together the features that typically appear in pictures of dogs without being told what exactly a dog is. However, there is no index, as you might find in the back of a book, that you can consult to find the exact “dial” or weight corresponding to doglike features; those features are intertwined across the model’s complex architecture. Researchers have to go find them.
This gets at the core challenge of AI interpretability. Researchers know how to build and train these models. But they often can’t see what, exactly, in an image causes the model to adjust one specific dial out of billions.
Understanding how a model makes its predictions can help illuminate how much we can trust it—or, if necessary, how to fix it when its behavior deviates from what we want or expect.

