Your Gut is a Value Function

Tim Cook once said the most important lesson he learned at Apple was to listen to his gut. That stuck with me, mostly because I had to learn it the hard way myself.

Early in my career, I thought “trust your gut” was code for “I don’t have the data.” But over time, I learned to listen to my gut, just like Cook describes doing at pivotal points of his career. The decisions I have regretted most were ones where I didn’t give voice to an uneasy feeling. And the best decisions have often been intuitive in the face of ambiguity and imperfect information.

It’s easy to dismiss this gut feeling as mystical, but now I realize it’s computational.

Your gut is a compressed summary of long‑horizon experiences that your conscious mind can’t read yet. In machine‑learning terms, it behaves a lot like a value function, the internal machinery that estimates how good or bad a situation is and where it’s likely to lead.

That idea turns “trust your gut” from self‑help cliché into a serious claim about how intelligence works, a concept that clicked for me while listening to Ilya Sutskever talk about a patient who lost his emotions.

The man who couldn’t decide

In a fantastic recent interview, Ilya Sutskever recounts a famous case from neurologist Antonio Damasio. A patient suffered brain damage to the area processing emotion. Post-surgery, his IQ was normal. His memory was perfect. He could list the pros and cons of any option.

But his life fell apart.

He spent twenty minutes deciding which pen to use. He couldn’t prioritize. Without the machinery to feel the difference between “good” and “bad,” he was trapped in an infinite loop of reasoning.

Damasio’s conclusion was that we don’t use logic to value things. We use “somatic markers”—emotional tags attached to past experiences. When a similar situation arises, your body replays a trace of the feeling: regret, relief, shame. That physical response is a shortcut. It solves the stopping problem.

As Sutskever suggests: emotions are the value function. Remove them, and you don’t get a super-rational agent. You get a system that can’t land the plane.

Why AI can’t fake this

This is exactly where the gap between human judgment and current AI lies.

Models read oceans of text to learn patterns. Then we use Reinforcement Learning to tell them what “good” looks like. But the rewards are short-term and dense: Did you solve the puzzle? Did the user like the summary?

That is a completely different animal from the human loop.

Your emotional value function is trained on the messy, long-term reality of your actual life. It integrates feedback that arrives years later—as a broken relationship, a career derailment, or the quiet satisfaction of doing the right thing. It associates tones of voice, deal structures, and clinical smells with outcomes that haven’t happened yet.

It’s not infallible—it carries bias and trauma—but it is the only model you have that has been trained on reality at scale.

Trying to approximate that with current AI training methods is like trying to learn “good parenting” from a dataset of multiple-choice quizzes. We give models crude rules: don’t be toxic, be helpful. That’s useful. That’s useful. But it’s nowhere near a value system that understands that a decision can be technically correct and still be completely wrong.

The intelligence of the gut

This matters most in domains with slow feedback and high stakes—strategy, medicine, policy.

AI is already a powerful tool for reasoning. It can out-read and out-simulate us, dig through mountains of data, and catch patterns you’d miss. But we shouldn’t confuse reasoning with judgment.

The pattern-recognition parts of our jobs are being automated. The piece that remains scarce is the long-horizon, emotionally anchored sense of what is actually worth doing.

“Trust your gut” isn’t an abandonment of reason; it’s a reminder that there is a layer of intelligence we haven’t yet reproduced in silicon. Your emotional life is a value function continuously trained by reality over years, while today’s AI systems still optimize short‑term, narrow proxies on curated benchmarks. For the decisions that actually shape a life or an organization, that quiet hum in your chest is not something we’re going to outsource anytime soon. It is the distinction between calculating what we can do, and knowing what we should do.

The New Computer in the Clinic

Andrej Karpathy describes the current moment as the rise of a new computing paradigm that he calls Software 3.0. as large language models emerge not just as clever chatbots but as a “new kind of computer” (“LLM OS”). In this model, the LLM is the processor, its context window is the RAM, and a suite of integrated tools are the peripherals. We program this new machine not with rigid code, but with intent, expressed in plain language. This vision is more than a technical curiosity; it is a glimpse into a future where our systems don’t just execute commands, but understand intent.

Healthcare is the perfect test bed for this paradigm. For decades, the story of modern medicine has been a paradox: we are drowning in data, yet starved for wisdom. The most vital clinical information—the physician’s reasoning, the patient’s narrative, the subtle context that separates a routine symptom from a looming crisis—is often trapped in the dark matter of unstructured data. An estimated 80% of all health data lives in notes, discharge summaries, pathology reports, and patient messages. This is the data that tells us not just what happened, but why.

For years, this narrative goldmine has been largely inaccessible to computers. The only way to extract its value was through the slow, expensive, and error-prone process of manual chart review, or a “scavenger hunt” through the patient chart. What changes now is that the new computer can finally read the story. An LLM can parse temporality, nuance, and jargon, turning long notes into concise, cited summaries and spotting patterns across documents no human could assemble in time.

But this reveals the central conflict of digital medicine. The “point-and-click” paradigm of the EHR, while a primary driver of burnout, wasn’t built merely for billing. It was a necessary, high-friction compromise. Clinical safety, quality reporting, and large-scale research depend on the deterministic, computable, and unambiguous nature of structured data. You need a discrete lab value to fire a kidney function alert. You need a specific ICD-10 code to find a patient for a clinical trial. The EHR forced clinicians to choose: either practice the art of medicine in the free-text narrative (which the computer couldn’t read) or serve as a data entry clerk for the science of medicine in the structured fields. Often, the choice has been the latter, which has contributed to massive burnout among physicians. This false dichotomy has defined the limits of healthcare IT for a generation.

The LLM, by itself, cannot solve this. As Karpathy points out, this new “computer” is deeply flawed. Its processor has a “jagged” intelligence profile—simultaneously “superhuman” at synthesis and “subhuman” at simple, deterministic tasks. More critically, it is probabilistic and prone to hallucination, making it unfit to operate unguarded in a high-stakes clinical environment. This is why we need what Karpathy calls an “LLM Operating System”. This OS is the architectural “scaffolding” designed to manage the flawed, probabilistic processor. It is a cognitive layer that wraps the LLM “brain” in a robust set of policy guardrails, connecting it to a library of secure, deterministic “peripheral” tools. And this new computer is fully under the control of the clinician who “programs” it in plain language.

This new architecture is what finally resolves the art/science conflict. It allows the clinician to return to their natural state: telling the patient’s story.

To see this in action, imagine the system reading a physician’s note: “Patient seems anxious about starting insulin therapy and mentioned difficulty with affording supplies.” The LLM “brain” reads this unstructured intent. The OS “policy layer” then takes over, translating this probabilistic insight into deterministic actions. It uses its “peripherals”—its secure APIs—to execute a series of discrete tasks: it queues a nursing call for insulin education, sends a referral to a social worker, and suggests adding a structured ‘Z-code’ for financial insecurity to the patient’s problem list. The art of the narrative is now seamlessly converted into the computable, structured science needed for billing, quality metrics, and future decision support.

This hybrid architecture—a probabilistic mind guiding a deterministic body—is the key. It bridges the gap between the LLM’s reasoning and the high-stakes world of clinical action. It requires a healthcare-native data platform to feed the LLM reliable context and a robust system of action layer to ensure its outputs are safe. This design directly addresses what Karpathy calls the “spectrum of autonomy.” Rather than an all-or-nothing “agent,” the OS allows for a tunable “autonomy slider.” In a “co-pilot” setting, the OS can be set to only summarize, draft, and suggest, with a human clinician required for all approvals. In a more autonomous “agent” setting, the OS could be permitted to independently handle low-risk, predefined tasks, like queuing a routine follow-up.

The journey is just beginning in healthcare, and my guess is that we will see different starting points for this across the ecosystem. But the path forward is illuminated by a clear thesis: the “new computer” for healthcare changes the very unit of clinical work. We are moving from a paradigm of clicks and codes—where the human serves the machine—to one of intent and oversight. The clinician’s job is no longer data entry. It is to practice the art of medicine, state their intent, and supervise an intelligent system that, for the first time, can finally understand the story.

America’s Patchwork of Laws Could Be AI’s Biggest Barrier in Care

AI is learning medicine, and early state rules read as if regulators are regulating a risky human, not a new kind of software. That mindset could make sense in the first wave, but it might also freeze progress before we see what these agents can do. When we scaled operations at Carbon Health, the slowest parts were administrative and regulatory–months of licensure, credentialing, and payer enrollment that shifted at each state line. AI agents could inherit the same map, fifty versions of permissions and disclosures layered on top of consumer‑protection rules. Without a federal baseline, the most capable tools might be gated by local paperwork rather than clinical outcomes, and what should scale nationally could move at the pace of the slowest jurisdiction.

What I see in state action so far is a conservative template built from human analogies and fear of unsafe behavior. One pattern centers on clinical authority. Any workflow that could influence what care a patient receives might trigger rules that keep a licensed human in the loop. In California, SB 1120 requires licensed professionals to make final utilization review decisions, and proposals in places like Minnesota and Connecticut suggest the same direction. If you are building automated prior authorization or claims adjudication, this likely means human review gates, on-record human accountability, and adverse‑action notices. It could also mean the feature ships in some states and stays dark in others.

A second pattern treats language itself as medical practice. Under laws like California’s AB 3030, if AI generates a message that contains clinical information for a patient, it is regulated as though it were care delivery, not just copy. Unless a licensed clinician reviews the message before it goes out, the provider must disclose to the patient that it came from AI. That carve-out becomes a design constraint. Teams might keep a human reviewer in the loop for any message that could be interpreted as advice — not because the model is incapable, but because the risk of missing a required disclosure could outweigh the convenience of full automation. In practice, national products may need state-aware disclosure UX and a tamper-evident log showing exactly where a human accepted or amended AI-generated output.

A third pattern treats AI primarily as a consumer-protection risk rather than a medical tool. Colorado’s law is the clearest example: any system that is a “substantial factor” in a consequential healthcare decision is automatically classified as high risk. Read broadly, that could pull in far more than clinical judgment. Basic functions like triage routing, benefit eligibility recommendations, or even how an app decides which patients get faster service could all be considered “consequential.” The worry here is that this lens doesn’t just layer on to FDA oversight — it creates a parallel stack of obligations: impact assessments, formal risk programs, and state attorney general enforcement. For teams that thought FDA clearance would be the governing hurdle, this is a surprise second regime. If more states follow Colorado’s lead, we could see dozens of slightly different consumer-protection regimes, each demanding their own documentation, kill switches, and observability. That is not just regulatory friction — it could make it nearly impossible to ship national products that influence care access in any way.

Mental health could face the tightest constraints. Utah requires conspicuous disclosure that a user is engaging with AI rather than a licensed counselor and limits certain data uses. Illinois has barred AI systems from delivering therapeutic communications or making therapeutic decisions while permitting administrative support. If interpreted as drafted, “AI therapist” positioning might need to be turned off or re‑scoped in Illinois.

Taken together, these state patterns set the core product constraints for now, keep a human in the loop for determinations, label or obtain sign‑off for clinical communications, and treat any system that influences access as high risk unless proven otherwise.

Against that backdrop, the missed opportunity becomes clear if we keep regulating by analogy to a fallible human. Properly designed agents could be safer than average human performance because they do not fatigue, they do not skip checklists, they can run differential diagnoses consistently, cite evidence and show their work, auto‑escalate when confidence drops, and support audit after the fact. They might be more intelligent on specific tasks, like guideline‑concordant triage or adverse drug interaction checks, because they can keep every rule current. They could even be preferred by some patients who value privacy, speed, or a nonjudgmental tone. None of that is guaranteed, but the path to discover it should not be blocked by rules that assume software will behave like a reckless intern forever.

For builders, the practical reality today is uneven. In practice, this means three operating assumptions: human review on decisions; clinician sign‑off or labeling on clinical messages; and heightened scrutiny whenever your output affects access. The same agent might be acceptable if it drafts a clinician note, but not acceptable if it reroutes a patient around a clinic queue because that routing could be treated as a consequential decision. A diabetes coach that nudges adherence could require a disclosure banner in California unless a clinician signs off, and that banner might not be enough if the conversation drifts into therapy‑like territory in Illinois. A payer that wants automation could still need on‑record human reviewers in California, and might need to turn automation off if Minnesota’s approach advances. Clinicians will likely remain accountable to their boards for outcomes tied to AI they use, which suggests that a truly autonomous AI doctor does not fit into today’s licensing box and could collide with Corporate Practice of Medicine doctrines in many states.

We should adopt a federal framework that separates assistive from autonomous agents, and regulate each with the right tool. Assistive agents that help clinicians document, retrieve, summarize, or draft could live under a national safe harbor. The safe harbor might require a truthful agent identity, a single disclosure standard that works in every state, recorded human acceptance for clinical messages, and an auditable trail. Preemption matters here. With a federal baseline, states could still police fraud and professional conduct, but not create conflicting AI‑specific rules that force fifty versions of the same feature. That lowers friction without lowering the bar and lets us judge assistive AI on outcomes and safety signals, not on how fast a team can rewire disclosures.

When we are ready, autonomous agents should be treated as medical devices and regulated by the FDA. Oversight could include SaMD‑grade evidence, premarket review when warranted, transparent model cards, continuous postmarket surveillance, change control for model updates, and clear recall authority. Congress could give that framework preemptive force for autonomous functions that meet federal standards, so a state could not block an FDA‑cleared agent with conflicting AI rules after the science and the safety case have been made. This is not deregulation. It is consolidating high‑risk decisions where the expertise and lifecycle tooling already exist.

Looking a step ahead, we might also license AI agents, not just clear them. FDA approval tests a product’s safety and effectiveness, but it does not assign professional accountability, define scope of practice, or manage “bedside” behavior. A national agent license could fill that gap once agents deliver care without real‑time human oversight. Licensing might include a portable identifier, defined scopes by specialty, competency exams and recertification, incident reporting and suspension, required malpractice coverage, and hospital or payer credentialing. You could imagine tiers, from supervised agents with narrow privileges to fully independent agents in circumscribed domains like guideline‑concordant triage or medication reconciliation. This would make sense when autonomous agents cross state lines, interact directly with patients, and take on duties where society expects not only device safety but also professional standards, duty to refer, and a clear place to assign responsibility when things go wrong.

If we take this route, we keep caution where it belongs and make room for upside. Assistive tools could scale fast under a single national rulebook. Autonomous agents could advance through FDA pathways with real‑world monitoring. Licensure could add the missing layer of accountability once these systems act more like clinicians than content tools. Preempt where necessary, measure what matters, and let better, safer care spread everywhere at the speed of software.

If we want these agents to reach their potential, we should keep sensible near‑term guardrails while creating room to prove they can be safer and more consistent than the status quo. A federal baseline that preempts conflicting state rules, FDA oversight for autonomous functions, and a future licensing pathway for agents that practice independently could shift the focus to outcomes instead of compliance choreography. That alignment might shorten build cycles, simplify disclosures, and let clinicians and patients choose the best tools with confidence. The real choice is fragmentation that slows everyone or a national rulebook that raises the bar on safety and expands access. Choose the latter, and patients will feel the benefits first.

AI Can’t “Cure All Diseases” Until It Beats Phase 2

One of the big dreams of AI researchers is that it will soon solve drug discovery and unleash a boom in new life-saving therapies. Alphabet committed $600 million in new capital to Isomorphic Labs on that rhetoric, promising to “cure all diseases” as its first AI‑designed molecules head to humans next year. And the first wave of AI molecules is moving quickly with Insilico, Recursion, Exscientia, Nimbus, DeepCure, and others all touting pipelines flush with AI‑generated candidates.

I can’t help but step back and ask if these AI efforts are focused on the right problem. We have no doubt increased the shots on goal upstream in the drug discovery process and (hopefully) have improved the quality of drug candidates being prosecuted.  

But have we solved the Phase 2 problem with AI yet? I think the jury is still out.  

As a young McKinsey consultant, I was staffed on several projects to benchmark R&D for pharma companies analyzing probability-of-success for molecules to graduate from phase 1 through phase 3 and achieve regulatory approval. Two decades and billions of dollars in R&D later, the brutal hard statistic that is impossible to ignore is that more than 70 percent of development programs still die in Phase 2. 

Phase 2 timelines, meanwhile, have stretched from 23.1 to 29.4 months between 2020 and 2023 as narrower inclusion criteria collided with stagnant site productivity. Dose‑finding missteps and operational glitches matter, but lack of efficacy still explains most Phase 2 failures, which comes down to our understanding of human biology.  

Human‑biology validation 1.0 — population genetics and its ceiling

When Amgen bought deCODE in 2012, it placed a billion‑dollar bet that large‑scale germ‑line sequencing could de‑risk targets by exploiting “experiments of nature.” I remember hearing the puzzlement in the industry around why a drug company would acquire a genomics company with an Icelandic cohort, but Amgen’s leadership had an inspired vision around human genetics. Its purchase of deCODE in 2012 was less about PCSK9—whose genetic validation and clinical program were already well advanced—and more about institutionalizing that genetics-first playbook for the next wave of targets. PCSK9 showed the concept works; deCODE was Amgen’s bet that lightning could strike again, this time in-house rather than through the literature. Regeneron followed a cleaner genetics-first path: its in-house Genetics Center linked ANGPTL3 loss-of-function to ultra-low lipids and later developed evinacumab, now approved for homozygous familial hypercholesterolaemia.

Yet even these success stories expose the model’s constraints. The deCODE Icelandic cohort is 94 percent Scandinavian; it produces brilliant cardiovascular signals but scant power in oncology, auto‑immune disease, or psych. Variants of large effect are vanishingly rare; deCODE’s 400,000 individuals yielded only thirty high‑confidence loss‑of‑function genes with drug‑like tractability in its first decade. More importantly, germ‑line data are static and de‑identified. Researchers cannot pull a fresh sample or biopsy from a knock‑out when a resistance mechanism appears, nor can they prospectively route those carriers into an adaptive arm without new consent and ethics review.

National mega‑registries were meant to fix that scale problem. The UK Biobank now pairs half‑a‑million exomes with three decades of clinical metrics, All of Us has over 450,000 electronic health records, and Singapore’s SG100K is sequencing a hundred‑thousand diverse genomes. Each has already contributed massively to science—UKB linked Lp‑a to coronary risk; All of Us resolved ancestry‑specific HDL loci—yet they remain fundamentally retrospective with high latency. Access to UK Biobank takes a median fifteen weeks from application to data release, and physical samples trigger an additional governance review whose queue exceeded 2,000 requests in 2024. All of Us explicitly bars direct re‑contact of participants except under a separate ancillary‑study board, adding six to nine months before a living cohort can be re‑surveyed. SG100K requires separate negotiation with every contributing hospital before a single tube can leave the freezer. None of these infrastructures were built for real‑time iteration, and so they do not break the Phase 2 bottleneck.

Twenty years after deCODE, the first hint that real‑time human biology could collapse development timelines came from Penn Medicine. By keeping leukapheresis, viral‑vector engineering, cytokine assays, and the clinic within one building, the Abramson group iterated through more than a hundred vector designs in four years and delivered CTL019, later commercialized by Novartis as Kymriah. In an earlier era, that triumph proved proximity and feedback loops matter.  

Human‑biology validation 2.0 — live tissue, live data, live patients

I believe the next generation of translational engines should be built around a simple rule: test the drug on the same biology it is meant to treat, while that biology is still evolving inside the patient. Academic hubs and data‑first companies can now collect biopsies and blood draws in real time, run single‑cell or organoid assays rapidly, and stream the results into AI and ML models that sit on the same network as the electronic health record. Because the material is fresh, the read‑outs still carry the stromal, immune and epigenetic signals that drive clinical response. In controlled comparisons, patient-derived organoid (PDO) assays explain roughly two-thirds of clinical response variance; immortal lines barely crack ten percent. The effect is practical, not academic. And the payoff: drugs that light up fresh tissue advance into enriched cohorts with a much higher chance of clinical benefit.

The loop does more than accelerate timelines. Serial sampling turns the platform into a resistance radar: if an AML clone abandons BCL‑2 dependence and switches to CD70, the lab confirms whether a CD70 antibody kills the new population and, if it does, the inclusion criteria change before the next enrollment wave. What begins as rapid failure avoidance quickly translates into higher positive‑predictive value for efficacy—fewer false starts, more shots on goal that land.

Put simply, live‑biology platforms might do for Phase 2 what human genetics did for target selection: they raise the pre‑test odds. Only this time the bet is placed at the moment of clinical proof‑of‑concept, when the stakes are highest and the cost of guessing wrong is measured in nine figures.

The academic medical center’s moment

Academic medical centers already hold the raw ingredients for this 21st century learning healthcare system: biobanks, CLIA labs, petabytes of historical EHR data, and a captive patient population. What they typically lack is integration. Tissue flows into siloed freezers; governance teams treat every data pull as bespoke; pathologists and computational scientists report to different deans. Institutions that solder those pieces into a single engine are becoming indispensable to AI chemists and to capital.

Privacy is no longer the show‑stopper; the tools to protect it—tokenized patient IDs, one‑time broad consent, and secure cloud pipelines—already work in practice. The real lift is technical and operational. A live‑biology hub needs a single ethics board that can clear new assays in days, a Part 11–compliant cloud that crunches multi‑omic data at AI scale, and a wet‑lab team able to turn a fresh biopsy into single‑cell or spatial read‑outs before the patient’s next visit. Just as important, it needs a funding model in partnership with pharma that pays for translational speed and clinical impact, not for papers or posters.

From hype to human proof

The next leap in drug development will come when AI‑driven chemistry meets the living biology that only hospitals can provide. Molecules generated overnight will matter only if they are tested, refined, and validated in the same patients whose samples inspire them. Almost every academic medical center already holds the raw materials—tissue, data, expertise—to close that loop. What we need now is the ambition to connect the pieces and the partnerships to keep the engine running at clinical speed. If you are building, funding, regulating, or championing this kind of “live‑biology” platform, I want to hear from you. Let’s compare notes and turn today’s proof points into tomorrow’s standard of care.

Level‑5 Healthcare: Why Prescribing Will Decide When AI Becomes a Real Doctor

Every week seems to bring another paper or podcast trumpeting the rise of diagnostic AI. Google DeepMind’s latest pre‑print on its Articulate Medical Intelligence Explorer (AMIE) is a good example: the model aced a blinded OSCE against human clinicians, but its researchers still set restrictive guardrails, forbidding any individualized medical advice and routing every draft plan to an overseeing physician for sign‑off. In other words, even one of the most advanced AI clinical systems stops at Level 3–4 autonomy—perception, reasoning, and a recommended differential—then hands the wheel back to the doctor before the prescription is written.

Contrast that with the confidence you hear from Dr. Brian Anderson, CEO of the Coalition for Health AI (CHAI), on the Heart of Healthcare podcast. Asked whether software will soon go the full distance, he answers without hesitation: we’re “on the cusp of autonomous AI doctors that prescribe meds” (18:34 of the episode), and the legal questions are now “when,” not “if”. His optimism highlights a gap in today’s conversation and research. Much like the self‑driving‑car world, where Level 4 robo‑taxis still require a remote safety driver, clinical AI remains stuck below Level 5 because the authority to issue a lawful e‑script is still tethered to a human medical license.

Prescribing is the last mile of autonomy. Triage engines and diagnostic copilots already cover the cognitive tasks of gathering symptoms, ruling out red flags, and naming the likely condition. But until an agent can both calculate the lisinopril uptitration and transmit the order across NCPDP rails—instantly, safely, and under regulatory blessing—it will remain an impressive co‑pilot rather than a self‑driving doctor.

During my stint at Carbon Health, I saw that ~20 percent of urgent‑care encounters (and upwards of 60-70% during the pandemic) boiled down to a handful of low‑acuity diagnoses (upper‑respiratory infections, UTIs, conjunctivitis, rashes), each ending with a first‑line medication. External data echo the pattern: acute respiratory infections alone account for roughly 60 percent of all retail‑clinic visits. These are the encounters that a well‑trained autonomous agent could resolve end‑to‑end if it were allowed to both diagnose and prescribe.

Where an AI Doctor Could Start

Medication titration is a beachhead.

Chronic-disease dosing already follows algorithms baked into many guidelines.  The ACC/AHA hypertension playbook, for instance, tells clinicians to raise an ACE-inhibitor dose when average home systolic pressure stays above 130–139 mm Hg or diastolic above 80 mm Hg despite adherence.  In practice, those numeric triggers languish until a patient returns to the clinic or a provider happens to review them weeks later.  An autonomous agent that reads Bluetooth cuffs and recent labs could issue a 10-mg uptick the moment two out-of-range readings appear—no inbox ping, no phone tag. Because the input variables are structured and the dose boundaries are narrow, titration in theory aligns with FDA’s draft “locked algorithm with guardrails” pathway. 

Refills are administrative drag begging for automation.

Refill requests plus associated messages occupy about 20 % of primary care inbox items. Safety checks—labs, allergy lists, drug–drug interactions—are deterministic database look-ups. Pharmacist-run refill clinics already demonstrate that protocol-driven renewal can cut clinician workload without harming patients. An AI agent integrated with the EHR and a PBM switch can push a 90-day refill when guardrails pass; if not, route a task to the care team. Because the agent is extending an existing prescription rather than initiating therapy, regulators might view the risk as modest and amenable to a streamlined 510(k) or enforcement-discretion path, especially under the FDA’s 2025 draft guidance that explicitly calls out “continuation of established therapy” as a lower-risk SaMD use.

Minor‑Acute Prescriptions

Uncomplicated cystitis is an ideal condition for an autonomous prescriber because diagnosis rests on symptoms alone in women 18-50. Dysuria and frequency with no vaginal discharge yields >90 % post‑test probability, high enough that first‑line antibiotics are routinely prescribed without a urine culture.

Because the diagnostic threshold is symptom‑based and the therapy a narrow‑spectrum drug with well‑known contraindications, a software agent can capture the entire workflow: collect the symptom triad, confirm the absence of red‑flag modifiers such as pregnancy or flank pain, run a drug‑allergy check, and write the 100 mg nitrofurantoin script, escalating when red flags (flank pain, recurrent UTI) appear.

Amazon Clinic already charges $29 for chat‑based UTI visits, but every case still ends with a clinician scrolling through a template and clicking “Send.” Replace that final click with an FDA‑cleared autonomous prescriber and the marginal cost collapses to near-zero.

What unites titrations, refills, and symptom‑driven UTI care is bounded variance and digital exhaust. Each fits a rules engine wrapped with machine‑learning nuance and fenced by immutable safety stops—the very architecture the new FDA draft guidance and White House AI Action Plan envision. If autonomous prescribing cannot begin here, it is hard to see where it can begin at all.

The Emerging Regulatory On‑Ramp

When software merely flags disease, it lives in the “clinical‑decision support” lane: the clinician can still read the chart, double‑check the logic, and decide whether to act. The moment the same code pushes an order straight down the NCPDP SCRIPT rail it graduates to a therapeutic‑control SaMD, and the bar rises. FDA’s draft guidance on AI‑enabled device software (issued 6 January 2025) spells out the higher bar. It asks sponsors for a comprehensive risk file that itemizes hazards such as “wrong drug, wrong patient, dose miscalculation” and explains the guard‑rails that block them. It also demands “objective evidence that the device performs predictably and reliably in the target population.” For an autonomous prescriber, that likely means a prospective, subgroup‑powered study that looks not just at diagnostic accuracy but at clinical endpoints—blood‑pressure control, adverse‑event rates, antibiotic stewardship—because the software has taken over the act that actually changes the patient’s physiology.

FDA already reviews closed‑loop dossiers, thanks to insulin‑therapy‑adjustment devices. The insulin rule at 21 CFR 862.1358 classifies these controllers as Class II but layers them with special controls: dose ceilings, automatic shut-off if data disappear, and validation that patients understand the algorithm’s advice. A triage‑diagnose‑prescribe agent could follow the same “closed-loop” logic. The draft AI guidance even offers a regulatory escape hatch for the inevitable updates: sponsors may file a Predetermined Change Control Plan so new drug‑interaction tables or revised dose caps can roll out without a fresh 510(k) as long as regression tests and a live dashboard show no safety drift.

Federal clearance, however, only opens the front gate. State practice acts govern who may prescribe. Idaho’s 2018 pharmacy‑practice revision lets pharmacists both diagnose influenza and prescribe oseltamivir on the spot, proving lawmakers will grant new prescriptive authority when access and safety align. California has gone the other way, passing AB 3030, which forces any clinic using generative AI for patient‑specific communication to declare the fact and provide a human fallback, signaling that state boards expect direct oversight of autonomous interactions. The 50-state mosaic, not the FDA, may be the hardest regulatory hurdle to cross.

Why It Isn’t Science Fiction

Skeptics argue that regulators will never let software write a prescription. But autonomous medication control is already on the market—inside every modern diabetes closed‑loop system. I have come to appreciate this technology as a board member of Tandem Diabetes Care over the last few years. Tandem’s t:slim X2 pump with Control‑IQ links a CGM to a dose‑calculating algorithm that micro‑boluses insulin every five minutes. The system runs unsupervised once prescribed with fenced autonomy in a narrowly characterized domain, enforced machine‑readable guardrails, and continuous post‑market telemetry to detect drift.

Translate that paradigm to primary‑care prescribing and the lift could be more incremental than radical. Adjusting lisinopril involves far fewer variables than real‑time insulin dosing. Refilling metformin after a clean creatinine panel is a lower‑risk call than titrating rapid‑acting insulin. If regulators were satisfied that a closed‑loop algorithm could make life‑critical dosing decisions, it is reasonable with equivalent evidence to believe they will approve an AI that nudges antihypertensives quarterly or issues amoxicillin when a CLIA‑waived strep test flashes positive. The path is the same: bounded indication, prospective trials, immutable guardrails, and a live data feed back to the manufacturer and FDA.

Closed‑loop diabetes technology did not replace endocrinologists; it freed them from alert fatigue and let them focus on edge cases. A prescribing‑capable AI agent could do the same for primary care, starting with the arithmetic medicine that dominates chronic management and low‑acuity urgent care, and expanding only as real‑world data prove its worth. Once the first agent crosses that regulatory bridge, the remaining span may feel as straightforward as the insulin pump’s development and adoption looked in retrospect.

The diagnostic revolution has taught machines to point at the problem. The next leap is letting them reach for the prescription pad within carefully coded guardrails. Titrations, refills, and simple infections are the logical, high‑volume footholds. With Washington signaling an interest in AI for healthcare, the biggest barriers may be other downstream issues like medical liability and reimbursement. That said, once the first FDA‑cleared AI issues a legitimate prescription on its own, it may only be a matter of time when waiting rooms and wait lists shrink to fit the care that truly requires a human touch.

Apple Watch: From Activity Rings to an AI-Powered Check-Engine Light

I had a front-row seat to the evolution of the Apple Watch as a health device. In the early days, it was clear that activity tracking was the killer use case and the Apple Watch hit its stride with millions of users closing their three Activity Rings every day. Over time, Apple added more sensors and algorithms with the FDA clearances of the irregular rhythm notification and the ECG app being huge milestones for Apple.

The “Dear Tim” letters about how the Apple Watch had saved someone’s life were of course anecdotal but hugely motivating. It made sense that the Apple Watch should be your ever-present intelligent health guardian. In marketing clips, a gentle haptic warns someone watching TV of an erratic pulse, or a fall alert summons help from a lonely farmhouse. Those use cases are real achievements, yet they always felt reactive: the Watch tells you what just happened. What many of us want is something closer to a car’s check-engine light—a quiet, always-on sentinel that notices when the machinery is drifting out of spec long before it stalls on the highway.

I was excited to read a research paper from Apple’s machine-learning group that nudges the Watch in that direction. The team trained what they call a Wearable Behavior Model (WBM) on roughly 2.5 billion hours of Apple Watch data collected from 162,000 volunteer participants in the Apple Heart & Movement Study. Instead of feeding the model raw second-by-second sensor traces, they distilled each day into the same high-level measures users already see in the Health app—steps, active energy, resting and walking heart rates, HRV, sleep duration, VO₂ max, gait metrics, blood-oxygen readings, even the time you spend standing. Those twenty-plus signals were aggregated into hourly slots, giving the AI a time-lapse view of one’s physiology and lifestyle: 168 composite “frames” per week, each summarizing a full hour of living.

Why behavior, not just biosignals?

We have years of studies on single sensors—pulse oximetry for sleep apnea, ECG for atrial fibrillation—yet most diseases emerge from slow drifts rather than sudden spikes. Infection, for example, pushes resting heart rate up and HRV down across several nights; pregnancy alters nightly pulse and sleep architecture months before a visible bump. Hour-scale summaries capture those patterns far better than any isolated waveform snippet. Apple’s scientists therefore looked for a sequence model able to ingest week-long, imperfectly sampled series without collapsing under missing values. Standard Transformers, the current favorite in modern AI, turned out to be brittle here; the winning architecture was a state-space network called Mamba-2, which treats the timeline as a continuous signal and scales linearly with sequence length. After training for hundreds of GPU-days the network could compress a week of life into a single 256-dimensional vector—a behavioral fingerprint of how your body has been operating.

What can the model see?

Apple put that fingerprint to the test on 57 downstream tasks. Some were relatively stable attributes (beta-blocker usage, smoking status, history of hypertension); others were genuinely dynamic (current pregnancy, recent infection, low-quality sleep this week). With nothing more than a linear probe on top of the frozen embedding, WBM often equaled or outperformed Apple’s earlier foundation model that had been trained on raw pulse waveforms. Pregnancy was a headline result: by itself the behavior model scored in the high 80s for area-under-ROC; when its embedding was concatenated with the pulse-wave embedding the combined system topped 0.92. Infection detection cleared 0.75, again without any fine-tuned, disease-specific engineering. In simpler language: after watching how your activity, heart rate, and sleep ebb and flow over a fortnight, the Watch can hint that you are expecting—or fighting a virus—days before you would otherwise know.

Diabetes was the notable exception. Here the low-level PPG model remained stronger, suggesting that some conditions leave their first footprints in waveform micro-shape rather than in day-scale behavior. That finding is encouraging rather than disappointing: it implies that the right strategy is fusion. The Watch already holds continuous heart-rate curves and daily summaries; passing both through complementary models delivers a richer early-warning signal than either alone.

The check engine light: why this matters beyond Apple

The study hints at the check-engine light, but WBM is a research prototype and carries caveats. Its ground truth labels lean on user surveys and app logs, which can be subjective and were irregularly collected among the larger user base. Pregnancy was defined as the nine months preceding a birth recorded in HealthKit; infection relied on self-reported symptoms. Those proxies are good enough for academic metrics, yet they fall short of the clinical ground truth that regulators call “fit‑for‑purpose.” Before any watch lights a real check-engine lamp, it must be calibrated against electronic health-record diagnoses, lab tests, maybe even physician adjudication. Likewise, the Apple study cohort tilts young, U.S.-based, and tech-forward, the very people who volunteer for an app-based research registry. We do not yet know how the model fares for a seventy-year-old on multiple medications or for communities under-represented in the dataset.

Even with those limits, WBM pushes the industry forward for one simple reason: it proves that behavioral aggregates—the mundane counts and averages every wearable records—carry untapped clinical signal. Because step counts and sleep hours are universal across consumer wearables these days, this invites a larger ambition: a cross-platform health-embedding standard, a lingua franca into which any device can translate its metrics and from which any clinic can decode risk. If Apple’s scientists can infer pregnancy from Watch data, there is little reason a Whoop band, Oura ring, or a Fitbit could not do the same once models are trained on diverse hardware.

The translational gap

Turning WBM‑style science into a shipping Apple Watch feature is less about GPUs and more about the maze that begins the moment an algorithm claims to recognize disease. Whoop learned that the hard way this month. Its “Blood‑Pressure Insights” card told members whether their estimated systolic and diastolic trend was drifting upward. Those trend arrows felt innocuous (just context, the company argued) yet the FDA sent a warning letter anyway, noting that anything purporting to detect or manage hypertension is a regulated medical device unless cleared. The agency’s logic was brutal in its simplicity: blood pressure has long been a clinical sign of disease, therefore software that assesses it needs the same evidentiary burden as a cuff. Consumers may have loved the insight; regulators hated the leap.

Apple faces a similar fork in the road if it ever wants the Watch to say, “Your metrics suggest early pregnancy” or “You’re trending toward an infection.”

The regulated‑product pathway treats each prediction as a stand‑alone medical claim. The company submits analytical and clinical evidence—typically through a 510(k) if a predicate exists or a De Novo petition if it does not. Because there is no earlier device that estimates pregnancy probability from wrist metrics or viral‑infection likelihood from activity patterns, every WBM condition would start life as a De Novo. Each requires gold‑standard comparators (hCG tests for pregnancy, PCR labs for infection, auscultatory cuffs for blood pressure), multi‑site generalizability data, and post‑market surveillance. For a single target such as atrial fibrillation that process is feasible; for fifty targets it becomes an assembly line of studies, submissions and annual reports—slow, expensive, and unlikely to keep pace with the model’s evolution. (The predicate gap may narrow if FDA finalizes its draft guidance on “AI/ML Predetermined Change Control Plans,” but no such framework exists today.)

The clinical‑decision‑support (CDS) pathway keeps the algorithm inside the chart, not on the wrist. WBM generates a structured risk score that lands in the electronic health record; a nurse or physician reviews the underlying metrics and follows a protocol. Because the human can understand the basis for the suggestion and retains authority, FDA does not classify the software as a device. What looks like a regulatory compromise could, in practice, be a pragmatic launch pad. Health‑system command centers already triage hundreds of thousands of automated vitals by routing most through pre‑approved branching pathways. A WBM score could enter the same stream with a playbook that says, in effect, “probability > 0.80, send flu self‑test; probability > 0.95, escalate to telehealth call.”

Building evidence inside a CDS loop—richer labels, stricter safeguards

A nurse clicking “yes, likely flu” is not the gold‑standard label FDA will accept in a De Novo file. At best it is the first hop in a chain that must culminate in an objective endpoint: a positive PCR, an ICD‑coded diagnosis, a prescription, an ED visit. A scalable CDS implementation therefore needs an automated reconciliation layer—a software platform that, over the ensuing week or month, checks the EHR and other databases for those confirmatory signals and links them back to the original wearable vector. With the right scaffolding, CDS becomes a scalable pathway for generating clinically verified labels and evidence that regulators may accept versus more traditional standalone clinical studies.


MAHA horizons

If the MAHA initiative succeeds in driving greater adoption of wearables and broker­ing de‑identified links to national EHR networks, the reconciliation layer sketched above could operate at population scale from day one (as long as consumers consent to sharing and linking their wearable data with their medical records). In that world an Apple, Fitbit, or Whoop would not merely ship a consumer feature; they would participate in a nationwide learning system where every alert, every lab, every diagnosis sharpens the next iteration of a check‑engine light on every wrist.

DOGE & #MakeAmericaHealthyAgain

Some personal opinions on what DOGE and #MakeAmericaHealthyAgain could do to reshape U.S. healthcare.

The HHS budget is ~$1.8 trillion, ~23% of the federal budget as proposed for FY25; CMS makes up the lion’s share of the budget (>80%). Hard to imagine reducing the deficit without major changes in how healthcare is funded and delivered.

In this table from the Dec 2022 OMB analysis of options to reduce the deficit, it is notable that most of the options involve cuts to Medicare and Medicaid.

Beyond these kinds of cuts, resetting the market dynamics and incentives of all players in the system, including individuals, could improve health outcomes in the long run.

Insurance and consumer-driven healthcare. One of the original sins of our healthcare system is the tax subsidy for employer-sponsored health insurance (IRC § 106). This preferential tax treatment is a major distortion of the market, bifurcating the buyers and consumers of healthcare and encouraging overconsumption.

It is also one of the federal government’s largest tax expenditures: $641 Bn in 2032.

The alternative would be to decouple employment from health insurance and give people more choice, accountability, and portability for how they get healthcare. Dr. Oz has written about shifting towards a Medicare Advantage (MA) for All system where employers fund a 20% payroll tax with a 10% individual income tax to extend the MA program to all Americans not on Medicaid. These dollars could be pooled into an expanded set of tax-advantaged individual & family accounts that could be used to buy insurance, similar in concept to Singapore’s healthcare system.

MA has grown in popularity with >50% of eligible beneficiaries opting for it over the traditional program, and it is one of the few pockets of our healthcare system that operates on the basis of consumer choice, where payers have a long-term incentive to keep their members healthy, and where value-based care arrangements with providers are common. This is the opposite of commercial insurance, where consumers have few choices beyond those picked for them, where employers-as-payers have little long-term incentive given the short tenure of most employees, and where fee-for-service arrangements are common.

If MA were to become a default option for most Americans, it would need to come with further reform as outlined in the OMB recommendations to become more efficient (e.g., reducing MA benchmarks, increasing Part B premiums) along with other initiatives to standardize processes and reduce administrative waste.

The other option would be to expand ICHRA (which came about during the first Trump administration) by preserving the tax favorability of providing a monthly allowance for employees to purchase individual health insurance while removing the tax subsidy for traditional employer-sponsored health insurance.

In both cases of MA for All or an ICHRA expansion, there would be a greater emphasis on choice, increased portability of plan, and greater accountability on and ownership of the individual for their health, with the biggest differences in the provider networks and how the plans are purchased and administered. The more disruptive move would be to make a cleaner break from employer-administered and sponsored healthcare. I would put my bet on MA for All given its greater scale and maturity compared to ICHRA/ACA.

Providers. The U.S. healthcare provider market is not one market but dozens of local oligopolistic health systems that have enormous pricing power. This market dominance has been compounded by the likes of large insurers that have vertically integrated to acquire provider networks. With the price transparency data that has only recently emerged, it is pretty eye-popping to see that the price dispersion for the same exact service between a large dominant health system and a smaller provider in a local market can be upwards of 300-500%.

The source of market power comes from being the only game(s) in town with respect to inpatient care — hospitals require large pools of capital and certificate-of-need laws are considerable barriers to entry. The prior Trump administration advocating for repealing CON laws could be re-raised in this go-around.

Site neutral payment policies can help neutralize the market advantage and pricing power of incumbents, reducing the pricing arbitrage that directs patients currently to higher-cost care settings.

Provider capacity needs to be expanded and liberalized to serve a better-functioning marketplace. The byzantine rules and regulations of state licensure authorities unnecessarily restrict provider entry and ability to practice, in particular for allied health professionals. The inane process for provider enrollment and credentialing restricts the labor market by delaying the process for onboarding a new provider into a practice by months.

Telemedicine and digital health more broadly need to be woven into the fabric of the broader provider marketplace rather than force-fit into the antiquated rules of traditional brick-and-mortar care to maximize available provider capacity. Patients should be able to establish care with telemedicine providers without needing to be seen in person first. Telehealth providers should be able to provide care for patients wherever they might be. And reimbursement for digital health and telemedicine should be at parity to traditional care. With AI-driven care models on the horizon, payment models should incentivize the utilization of these technologies to increase access and reduce the cost of care.

Empowering consumers. Even with the changes above, a consumer-driven healthcare system won’t be possible without greater price transparency and data liquidity. The existence of Costplusdrugs or GoodRx show the demand for transparent pricing models, and despite the push for greater transparency of hospital prices, many hospitals are not compliant and the average consumer has no chance of looking up prices on their own. Portability of one’s own data and giving consumers greater control of their own medical data has been a long-term goal of the federal government and was a Trump-45 era initiative that should be revived: https://cms.gov/newsroom/press-releases/trump-administration-announces-myhealthedata-initiative-put-patients-center-us-healthcare-system

Wellness. Wellness incentives and initiatives represent a tiny fraction of total spending by the U.S. healthcare system — likely in the low single digit percentages. Our system is known for its orientation around sick care, but what if reimbursement could be shifted dramatically towards prevention and wellbeing? A small absolute investment could be a big lever for longer-term health outcomes. The Singapore National Steps Challenge is a good example of providing modest financial incentives to encourage exercise. Could this concept be extended to healthy eating and other preventative measures?