The Jagged Frontier of Medical AI

Anthropic published a report recently called “When AI builds itself,” signaling that they are seeing early signs of recursive self-improvement: AI systems starting to build the next AI systems and a possible speedup in the march toward AGI. But even in their most aggressive takeoff scenario, they make a concession about where all that speed runs out:

Achieving recursive improvement alone does not suggest an immediate change in how industrial production occurs, societies organize, or markets function. More intelligence can’t learn what a drug does over decades of use, can’t hold elections sooner than a constitution dictates, and can’t turn a stranger into an old friend in a weekend. For most people, the felt pace of this future will still be set by the bottlenecks, even if the laboratory upstream runs at the speed of compute.

The loop runs fast at Anthropic because software is the rare domain that hands intelligence everything it needs to improve: the work is text it can read, and the test says pass or fail in seconds. Try something, check it, adjust, repeat, at machine speed. Medicine offers neither half consistently, which is why even a system that builds its own successors will not run through clinics the way it runs through code.

Even if AGI arrives (a system that reasons across medicine and biology at least as well as the best clinicians and scientists), healthcare will not transform all at once. Suppose an AI system designs a drug tomorrow, something better than the best we have. We would not know it worked for years, because the proof is a clinical outcome that arrives on the body’s schedule, not the model’s. This is the wall of verifiability: in medicine, checking whether the answer was right is slow, expensive, and often years away. A model can be brilliant and still wait a decade to be proven so, and intelligence does not move that clock.

Verifiability is the wall most people see. The second one sits underneath it. To verify anything, the work has to exist somewhere a model can read it, and most clinical reasoning was never written down. The EHR note records that a patient was sent home, not the read on how she looked or the threshold that would have changed the plan. That missing layer, the legibility of the work, is the deeper constraint, and the two walls fall at very different speeds. Legibility gives way fast, because capturing and integrating is exactly what intelligence is good at. Verification gives way slowly, because biology keeps its own clock and some questions have no single right answer at all.

Put the two walls on a grid and you get four kinds of medical work: a map of where AI can be delegated, where it remains supervised, and where full ownership may never make sense. Not whether AI helps — it already helps nearly everywhere, in decision support, hypothesis generation, literature synthesis done in an afternoon. The map is about where it gets to act without us checking. That is where the recursive loop runs unbounded because it only compounds where the work is legible and the answer verifiable. Everywhere else it stalls, no matter how much intelligence you pour in.

Fast-Loop Automation

Where the work is highly legible and easily verifiable, AI will increasingly run on its own. This is the corner that looks like code. The task is explicit, captured as a record a model can read, and the check comes back fast: a billing code matches the rulebook or it does not, a flagged nodule matches the biopsy or it does not. Not quite a programmer’s unit test, but close enough that a wrong answer gets caught quickly and cheaply.

That fast, external check is what lets a person step back. In the other three corners someone has to stay in the loop, because the output cannot yet be trusted on its own. Here it can, because the system gets a fast external check, even if that check is sometimes a rulebook or adjudicated label rather than ultimate clinical truth. Coding, documentation, imaging triage, lab interpretation: this is where AI moves from helping a clinician to doing the task itself.

It is also where the AGI story gets oversold. The systems that look most autonomous here are often the least general, narrow tools in a forgiving environment that hands them a clean answer key. Early autonomy here tells you the work was verifiable, not that the model was wise. This is the corner of medicine that behaves like software, the one place the recursive loop Anthropic is watching can actually run. And when people say medicine is being automated, this is mostly the corner they mean. It is a small one.

Hidden Decision Logic

Much of the clinically consequential reasoning in medicine sits where the outcome is visible but the thinking behind it is not. A clinician sends a patient home with watchful waiting instead of admitting her. The chart says stable, return if worse. Whether that was the right call becomes clear within days. She either comes back sicker or she does not, so the verification exists. What is missing is the reasoning that produced it: the read on how she looked, the threshold that would have changed the plan. The model can see the outcome and still never learn the decision, because the inputs to it were never written down.

The cleanest way to picture this corner is a gap between world models. The clinician runs on a far richer view of the patient than the model has. She sees the patient, hears the conversation, carries years of history in her head, and goes looking for more when something feels off. The model starts with the note, the labs, and whatever imaging got shared. So an AI’s first job here is not reasoning. It is seeing: the clinical state, the context, the patient’s preferences, the uncertainty. It cannot reason over a patient it never perceives.

One of the biggest things AI is doing in healthcare right now is making the work legible, building a data foundation that was never possible before. Ambient systems capture the conversation that used to evaporate at the end of a visit. Multimodal models stitch together the labs, imaging, notes, and device data a clinician now assembles in her head. Medicine rarely fails to generate data. The data dies in separate systems built for billing, not for care, and a model that integrates across those seams is performing an act of legibility.

What it captures is not neutral. The record made medicine legible to billing once, and the next layer could repeat that at higher resolution, capturing what was said and missing what was meant. Whoever builds it to serve the reasoning rather than the invoice will own more than they realize. It does not exist yet because no one is paid to build it: the economics still reward capturing charges, not capturing reasoning.

And capturing the data is not the same as learning good medicine from it. A decision linked to an outcome is a training signal only if you can tell why the decision was made and whether it caused the outcome. Without that causal discipline, a model trained on the record learns to imitate practice patterns, not to practice well. Legibility is the precondition, not the finish line.

Once the gap closes, the model may outperform the average clinician in protocolized care, not because it is wiser, but because it is more consistent. It does not forget the guideline or let its follow-up drift with fatigue. Its failures run the other way: missing context, or a confident wrong answer where a person would have hesitated. But where the patient state is observable and the evidence is stable, in hypertension, diabetes, preventive care, the sepsis bundle, consistency is a machine advantage, and a hard one for a tired human to match. What stays human is the judgment the evidence does not cover: the patient whose conditions pull the guidelines in opposite directions, the case where today’s consensus turns out to be wrong. It is smaller than the phrase “art of medicine” suggests, but it is real.

Delayed Validation

Other work is captured cleanly and still cannot be judged for years. A predicted binding affinity, a trial design, a treatment hypothesis: experts can read the output and often agree it looks strong. Whether it was right takes years, because the real test is a clinical outcome that turns on biology no quick check can stand in for. This is the corner where the “cure all diseases” hype keeps landing.

Drug discovery is the clearest case, and also where AI is doing some of its most exciting work, which is exactly why it is worth being precise about what is fast and what is not. Hassabis and Jumper shared half of the 2024 Nobel Prize in Chemistry for predicting protein structures, with David Baker taking the other half for protein design. AI-designed molecules dock beautifully against their targets, and many that clear that check still fail when they meet a human body. Early molecules may pass Phase 1 at higher rates than usual, but Phase 2 still looks like the industry average on small numbers, and the late-stage curve has not clearly moved yet.

The reason is that discovery runs on a chain of checks, each one a stand-in for the next. Docking asks whether a molecule should bind, a binding assay whether it does, a cell assay whether that produces a useful effect, an animal model whether the effect survives in a living system, the trials whether it helps real people. A model can climb the early rungs fast and still fall at the top, because what makes a molecule look good at the bench is not what makes it work in a person.

So AI accelerates the early stack and compresses years of bench work: generating targets, designing assays, optimizing molecules, choosing whom to enroll. What it cannot compress is the final proof. That is why it spreads fast as help and slowly as something we trust alone, since a scientist reading an AI hypothesis checks it as she goes and the model need not be right by itself. An AI could be right here long before anyone could prove it. In software, intelligence shows itself by closing the loop; here it can outrun the loop, and a recursive engine running at the speed of compute gains nothing when the next real answer is five years out. Those years are not biology alone, either. The regulator and the liability built on top stretch the wait further still. Strip all of that away and the biological clock keeps running. That clock is the floor, and no amount of intelligence lifts it.

Value-Laden Judgment

The last corner clears neither wall. The work never becomes a clean artifact, and the outcome never reduces to a single number, because there is no single right answer, only better and worse versions of one. This is the corner where AI can help with everything and own nothing.

A primary care physician manages a patient with five chronic conditions across fifteen years. No one visit holds the work. The judgment lives in which problem to push this time and which to let ride, when to raise advance directives, how to titrate against preferences that shift as the patient’s life shifts, when to admit the plan is no longer about adding years and start preparing the family. The arc is the work, and today’s records capture it badly. Even with a perfect record, no score settles whether she did well. Days at home, medication burden, survival, cost: each is a partial signal, none is the answer, because the answer folds in values the patient may never have fully stated.

This kind of work does not yield to time the way a drug does. Waiting twenty years tells you whether a drug caused harm. It does not tell you whether a doctor handled a dying patient’s last year well, because there was never one right way to handle it. Vinod Khosla has argued for over a decade that the part of medicine AI cannot take is the human part, the empathy and the accountability, and the map gives that intuition a structure. It is not that people are mystically better at care. It is that values and trajectory do not reduce to a function anything can score, and accountability has to live somewhere when no score can decide. There is an institutional version of the same point: even when an answer is technically checkable, someone still has to own what counts as an acceptable error and who is liable when it happens. An AI can improve every input to this work and still not own the decision, because nothing would tell it whether it chose well, and someone has to answer for the choice.

Back to the loop

The recursive loop is real, and it runs at the speed of compute in the place they measured it, because software clears both walls at once. Anthropic is not naive about bottlenecks, either. They name Amdahl’s law directly, they note that human code review has already become their constraint, and their most aggressive scenario puts humans on oversight and verification of an AI-run lab. But they only ever meet the bottleneck inside software, where verification can itself be automated or sped up. Medicine is where that same argument turns brutal, because the constraint moves to biology and to human values, and there is nothing there to automate it away.

The walls are not fixed, and the optimists are right that AI is pushing on both. It is tearing down legibility fastest, because capture feeds on itself: every gain makes the next layer cheaper, the same compounding the report describes. It is chipping at verification too, with better trials and surrogate endpoints that turn a five-year wait into months. But the verification wall has a floor the legibility wall does not, and the floor is biological time. You earn a surrogate by running the slow study once before you can trust the fast one. Intelligence moves the wall; it does not move the clock. So the two keep falling at different speeds, and the gap between them is the whole story.

None of this means AI does little in medicine. It will do an enormous amount, and soon: documentation, synthesis, triage, protocolized care, much of the early science, faster than the field expects. That is the legibility half of the loop, moving at something close to software speed. The verification half is the part that stays slow, because the patient’s future arrives on its own schedule and some questions never had a single answer to verify against.

The next decade of medical AI will turn not only on how smart the models get, but on how much of medicine we can make observable enough to learn from and verifiable enough to trust. Intelligence is becoming cheap. Clinical truth is not. The intelligence is coming. The question is whether reality will let us check it.

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Be Curious, Not Judgmental