
BY SEAN WILLIAMS, PhD
Santa Fe
Actually, I won’t use the term AI. It’s too broad: it can mean anything from zero-player tic-tac toe up through Colonel Sanders in The Matrix Reloaded. Instead I’ll say deep learning, which is the techniquethat’s in vogue today.
Deep learning is based on an old idea called artificial neural networks (ANNs), which have nothing to do with neurons. They’re actually a funky notation for closed-form mathematical functions. In deep learning, these functions have very high dimension and are very nonlinear.
“Training” or “learning” is just regression, aka curve fitting. Stochastic gradient descent is an algorithm for searching a curve’s parameter space for an error minimum between candidate curves and some example data.
Regression can be fine, but only if it meets a surprisingly nebulous criterion. The assumption is that you’ve measured some phenomenon, and that phenomenon has a mathematical character to it. Curve fitting works great if the type of curve you’ve selected is equivalent to the mathematical nature of the underlying phenomenon.
How do you know if you’ve met this criterion?
Well, you don’t. That’s why philosophers of science talk about verification and falsification, and it turns out neither approach is up to the task.
We do know what happens if you make an obviously wrong guess when selecting your curve. If you pick something too rigid, your results look bad and we call it underfitting. If you pick something too flexible, your results look great, but they fall apart in the real world. We call this overfitting.
There’s a test for overfitting, except it’s also not very good. Use some of your data for fitting the curve, and use the leftovers to test the quality of the fit. This still leaves you vulnerable to overfitting, because you can only test over the data you thought to gather with the instruments you have available.
Now remember, the validity of curve fitting is really down to the correspondence between curve and phenomenon. So what’s the closed-form function for human language? Of course, that’s not actually how large language models (LLMs) work. They model the sequencing of words in (hopefully human-written) text that happened to find its way into their databases.
So what’s the closed-form function for the entire corpus of human writing that’s survived to the modern age? Still a bizarre question, huh?
Writing is a highly path dependent activity. We write from our knowledge and experiences, which we develop over our entire lives. And our lives take place in a world that’s shaped by thousands of years of human history, and billions of years of natural history.
A closed-form function is never going to be a good fit for a path-dependent phenomenon. Yet something is missing from this account, because the user experiences between Cleverbot and ChatGPT are enormously different. What really separates deep learning from prior ANN work?
Deep learning allows you to calibrate overfitting, which is a very strange, very bad idea.
The textbook example—the textbook being Deep Learning, Goodfellow et al, 2016 considers the standard case of having ten points that roughly lie on a parabola, and fitting a ninth-degree polynomial.
Mean-squared error (MSE) is 0, but if you find an eleventh point, error probably skyrockets.
They noticed that the coefficient magnitudes of their ninth-degree polynomial are large, which they argue is symptomatic of overfitting.
Ironically, the only reason they actually know they’ve overfit is because they generated the data, so they know t … + c9x + c10he data are quadratic. Which is also how they know their curve should look parabolic. Large coefficients are not, in fact, an indicator of correctness of fit.
Anyway, their solution to this overfitting is to keep coefficient values low.
So they redefine error within the regression algorithm as E(c) = MSE(c) + ψc·c.
c is a vector of coefficients that defines a candidate polynomial, i.e.,

Curve fitting means finding the specific c that minimizes E(c).
ψ is a “hyperparameter,” which controls the extent to which large coefficient magnitudes are penalized while checking different coefficient values for an error-minimizing c. ψ is an input to the regression algorithm, meaning if you’re the one “training” the model, the value of ψ is completely up to you.
You can now fiddle with ψ and rerun the regression until you get a ninth-degree polynomial that locally masquerades as a parabola. The self-fulfilling prophecy has come true.
You can add whatever and however many additional terms and hyperparameters you want into the regression algorithm. I’ve heard that current LLMs have hundreds of thousands of them each of them an ad hoc barnacle that’s controlled by the model designers.
Since you can change hyperparameters and rerun the regression until you’re happy with the results, you’ve completely invalidated the already meager test for overfitting. Fiddling with hyperparameters is still part of the regression process, so the testing data are not being independently evaluated. Instead, the testing data act more like training data for your hyperparameter selection, and there are no true testing data.
You can even manipulate error calculations to produce results that look good in whatever by-hand testing you do. After all, we’re talking about billions of dimensions of nonlinearity, and changing error calculations gives you a completely different overfit. If you iterate enough, you’re bound to find one you like.
Which means that by-hand testing isn’t really testing, for exactly the same reasons.
The problem is, the “true” goal isn’t minimizing error or having a satisfactory testing experience. The goal is assessing how well the type of curve you selected lines up with the mathematical character of the phenomenon you’re studying.
Mixing model calibration with model validation means you’ve discarded the original phenomenon in favor of overfitting your testing protocols.
Sidestepping validation isn’t clever, it’s profoundly ignorant.
Issuing trillions of dollars of debt to sidestep validation at scale is crazy.
Since the true sign of overfitting is getting weird results when the model is deployed into the real world, we should now acknowledge that “hallucination” is just a self-serving term for overfitting error (aka generalization error). We should also acknowledge that hallucinations are here to stay, because deep learning is nothing more or less than a system for overfitting overfitting.
But here’s where we get to the real problem, or at least, the real problem that isn’t financial or
environmental.
Because deep learning output is a spurious recombination of its input, it’s tremendously unreliable. Proponents of deep learning will often concede that you have to be an expert to be able to separate the wheat from the chaff.
In principle, that’s a Dunning-Kruger trap. How good are you at debugging spurious code? And if you are that good, how is it not faster to just write the code yourself? Why haven’t you already metaprogrammed the boilerplate away? Template Haskell is calling; will you pick up the phone?
In practice, just read about what’s been happening with Microsoft’s efforts to vibe code Windows 11.
As for theory, the negative answers to the decision and halting problems already rule out the possibility of writing a computer program that can prove the correctness of general computer programs. Vibe coding in particular has always been a doomed expedition.
Now imagine being a scientist and doing the thing they tell you not to do with regression, actually going further by becoming an active participant in model overfitting, and passing this off as research. Then imagine this work being given the Turing Award and the Nobel Prize.
This topic could grow to hundreds if not thousands of pages, because it opens onto the biggest
philosophical problem of our times: the epistemology of complex systems.
That said, deep learning is the philosophical equivalent of being crouched at the starting line, the pistol is fired, and you suffer an unrelated cerebral embolism and fall over dead.
