Health Privacy in the AI Era

Sam Altman hardly ever breaks stride when he talks about ChatGPT, yet in a recent podcast he paused to deliver a blunt warning, which caused my ears to perk up. A therapist might promise that what you confess stays in the room, Sam said, but an AI chatbot cannot, at least not based on the current legal framework. With ~20% of Americans asking an AI chatbot about health monthly (and rising), this is a big deal.

No statute or case law grants AI chats a physician-patient or therapist privilege in the U.S., so a court can compel OpenAI to disclose stored transcripts. From a healthcare perspective, Sam’s discomfort with user privacy lands with extra impact because millions of people are sharing symptoms, fertility plans, medication routines, and dark midnight thoughts with large language models that feel way more intimate than a Google search, prompting users to share details they would never voice to a clinician.

Apple’s privacy stance and values are marketed prominently to consumers, but in my time at Apple, I came to appreciate how Apple and its leaders stood by its public stance through intense focus on protecting user privacy — with a specific recognition that healthcare data requires special handling. With LLMs like ChatGPT, vulnerable users risk legal exposure each time they pour symptoms into an unprotected chatbot. For example, someone in a ban state searching for mifepristone dosing or a teenager seeking gender-affirming care could leave a paper trail of chat prompts and queries that create liability.

The free consumer ChatGPT operates much like “Dr. Google” today in the health context. Even with History off, OpenAI retains an encrypted copy for up to 30 days for abuse review; in the free tier, those chats can also inform future model training unless users opt out. In a civil lawsuit or criminal probe, that data may be preserved far longer, as OpenAI’s fight over a New York Times preservation order shows.

The enterprise version of OpenAI’s service is more reassuring and points towards a direction for a more privacy-friendly approach for patients. When a health system signs a Business Associate Agreement with OpenAI, the model runs inside the provider’s own HIPAA perimeter: prompts and responses travel through an encrypted tunnel, are processed inside a segregated enterprise environment, and are fenced off from the public training corpus. Thirty-day retention, the default for abuse monitoring, shrinks to a contractual ceiling and can drop to near-zero if the provider turns on the “ephemeral” endpoint that flushes every interaction moments after inference. Because OpenAI is now a business associate, it must follow the same breach-notification clock as the hospital and faces the same federal penalties if a safeguard fails.

In practical terms the patient gains three advantages. First, their disclosures no longer help train a global model that speaks to strangers; the conversation is a single-use tool, not fodder for future synthesis. Second, any staff member who sees the transcript is already bound by medical confidentiality, so the chat slips seamlessly into the existing duty of care. Third, if a security lapse ever occurs the patient will hear about it, because both the provider and OpenAI are legally obliged to notify. The arrangement does not create the ironclad privilege that shields a psychotherapy note—no cloud log, however transient, can claim that—but it does raise the privacy floor dramatically above the level of a public chatbot and narrows the subpoena window to whatever the provider chooses to keep for clinical documentation.

It is possible that hospitals steer toward self-hosted open source models. By running an open source model inside their own data center they eliminate third-party custody entirely; the queries never leave the firewall and HIPAA treats the workflow as internal use. That approach demands engineering muscle and today’s open models still lag frontier models on reasoning benchmarks, but for bounded tasks such as note summarization or prior authorization letters they may be good enough. Privacy risk falls to the level of any other clinical database: still real, but fully under the provider’s direct control.

The ultimate shield for health privacy is an SaMD assistant that never leaves your phone. Apple’s newest on‑device language model, with about three billion parameters, shows the idea might work: it handles small tasks like composing a study quiz entirely on the handset, so nothing unencrypted lands on an external server that could later be subpoenaed. The catch is scale. Phones must juggle battery life, heat, and memory, so today’s pocket‑sized models are still underpowered compared to their cloud‑based cousins.

Over the next few product cycles two changes should narrow that gap. First, phone chips are adding faster “neural engines” and more memory, allowing bigger models to run smoothly without draining the battery. Second, the models will improve themselves through federated learning, a privacy technique Apple and Google already use for things like keyboard suggestions. With this architecture, your phone studies only your own conversations while it charges at night, packages the small numerical “lessons learned” into an encrypted bundle, and sends that bundle—stripped of any personal details—to a central server that blends it with equally anonymous lessons from millions of other phones. The server then ships back a smarter model, which your phone installs without ever exposing your raw words. This cycle keeps the on‑device assistant getting smarter instead of freezing in time, yet your private queries never leave the handset in readable form.

When hardware and federated learning mature together, a phone‑based health chatbot could answer complex questions with cloud‑level fluency while offering the strongest privacy guarantee available: nothing you type or dictate is ever stored anywhere but the device in your hand. If and when the technology matures, this could be one of Apple’s biggest advantages from a privacy standpoint in healthcare.

For decades “Dr. Google” meant we bartered privacy in exchange for convenience. Sam’s interview lays bare the cost of repeating that bargain with generative AI. Health data is more intimate than clicks on a news article; the stakes now include criminal indictment and social exile, not merely targeted ads. Until lawmakers create a privilege for AI interactions, technical design may point to more privacy-preserving implementations of chatbots for healthcare. Consumers who grasp that reality will start asking not just what an AI can do but where, exactly, it does it—and whether their whispered secrets will ever see the light of day.

Level‑5 Healthcare: Why Prescribing Will Decide When AI Becomes a Real Doctor

Every week seems to bring another paper or podcast trumpeting the rise of diagnostic AI. Google DeepMind’s latest pre‑print on its Articulate Medical Intelligence Explorer (AMIE) is a good example: the model aced a blinded OSCE against human clinicians, but its researchers still set restrictive guardrails, forbidding any individualized medical advice and routing every draft plan to an overseeing physician for sign‑off. In other words, even one of the most advanced AI clinical systems stops at Level 3–4 autonomy—perception, reasoning, and a recommended differential—then hands the wheel back to the doctor before the prescription is written.

Contrast that with the confidence you hear from Dr. Brian Anderson, CEO of the Coalition for Health AI (CHAI), on the Heart of Healthcare podcast. Asked whether software will soon go the full distance, he answers without hesitation: we’re “on the cusp of autonomous AI doctors that prescribe meds” (18:34 of the episode), and the legal questions are now “when,” not “if”. His optimism highlights a gap in today’s conversation and research. Much like the self‑driving‑car world, where Level 4 robo‑taxis still require a remote safety driver, clinical AI remains stuck below Level 5 because the authority to issue a lawful e‑script is still tethered to a human medical license.

Prescribing is the last mile of autonomy. Triage engines and diagnostic copilots already cover the cognitive tasks of gathering symptoms, ruling out red flags, and naming the likely condition. But until an agent can both calculate the lisinopril uptitration and transmit the order across NCPDP rails—instantly, safely, and under regulatory blessing—it will remain an impressive co‑pilot rather than a self‑driving doctor.

During my stint at Carbon Health, I saw that ~20 percent of urgent‑care encounters (and upwards of 60-70% during the pandemic) boiled down to a handful of low‑acuity diagnoses (upper‑respiratory infections, UTIs, conjunctivitis, rashes), each ending with a first‑line medication. External data echo the pattern: acute respiratory infections alone account for roughly 60 percent of all retail‑clinic visits. These are the encounters that a well‑trained autonomous agent could resolve end‑to‑end if it were allowed to both diagnose and prescribe.

Where an AI Doctor Could Start

Medication titration is a beachhead.

Chronic-disease dosing already follows algorithms baked into many guidelines. The ACC/AHA hypertension playbook, for instance, tells clinicians to raise an ACE-inhibitor dose when average home systolic pressure stays above 130–139 mm Hg or diastolic above 80 mm Hg despite adherence. In practice, those numeric triggers languish until a patient returns to the clinic or a provider happens to review them weeks later. An autonomous agent that reads Bluetooth cuffs and recent labs could issue a 10-mg uptick the moment two out-of-range readings appear—no inbox ping, no phone tag. Because the input variables are structured and the dose boundaries are narrow, titration in theory aligns with FDA’s draft “locked algorithm with guardrails” pathway.

Refills are administrative drag begging for automation.

Refill requests plus associated messages occupy about 20 % of primary care inbox items. Safety checks—labs, allergy lists, drug–drug interactions—are deterministic database look-ups. Pharmacist-run refill clinics already demonstrate that protocol-driven renewal can cut clinician workload without harming patients. An AI agent integrated with the EHR and a PBM switch can push a 90-day refill when guardrails pass; if not, route a task to the care team. Because the agent is extending an existing prescription rather than initiating therapy, regulators might view the risk as modest and amenable to a streamlined 510(k) or enforcement-discretion path, especially under the FDA’s 2025 draft guidance that explicitly calls out “continuation of established therapy” as a lower-risk SaMD use.

Minor‑Acute Prescriptions

Uncomplicated cystitis is an ideal condition for an autonomous prescriber because diagnosis rests on symptoms alone in women 18-50. Dysuria and frequency with no vaginal discharge yields >90 % post‑test probability, high enough that first‑line antibiotics are routinely prescribed without a urine culture.

Because the diagnostic threshold is symptom‑based and the therapy a narrow‑spectrum drug with well‑known contraindications, a software agent can capture the entire workflow: collect the symptom triad, confirm the absence of red‑flag modifiers such as pregnancy or flank pain, run a drug‑allergy check, and write the 100 mg nitrofurantoin script, escalating when red flags (flank pain, recurrent UTI) appear.

Amazon Clinic already charges $29 for chat‑based UTI visits, but every case still ends with a clinician scrolling through a template and clicking “Send.” Replace that final click with an FDA‑cleared autonomous prescriber and the marginal cost collapses to near-zero.

What unites titrations, refills, and symptom‑driven UTI care is bounded variance and digital exhaust. Each fits a rules engine wrapped with machine‑learning nuance and fenced by immutable safety stops—the very architecture the new FDA draft guidance and White House AI Action Plan envision. If autonomous prescribing cannot begin here, it is hard to see where it can begin at all.

The Emerging Regulatory On‑Ramp

When software merely flags disease, it lives in the “clinical‑decision support” lane: the clinician can still read the chart, double‑check the logic, and decide whether to act. The moment the same code pushes an order straight down the NCPDP SCRIPT rail it graduates to a therapeutic‑control SaMD, and the bar rises. FDA’s draft guidance on AI‑enabled device software (issued 6 January 2025) spells out the higher bar. It asks sponsors for a comprehensive risk file that itemizes hazards such as “wrong drug, wrong patient, dose miscalculation” and explains the guard‑rails that block them. It also demands “objective evidence that the device performs predictably and reliably in the target population.” For an autonomous prescriber, that likely means a prospective, subgroup‑powered study that looks not just at diagnostic accuracy but at clinical endpoints—blood‑pressure control, adverse‑event rates, antibiotic stewardship—because the software has taken over the act that actually changes the patient’s physiology.

FDA already reviews closed‑loop dossiers, thanks to insulin‑therapy‑adjustment devices. The insulin rule at 21 CFR 862.1358 classifies these controllers as Class II but layers them with special controls: dose ceilings, automatic shut-off if data disappear, and validation that patients understand the algorithm’s advice. A triage‑diagnose‑prescribe agent could follow the same “closed-loop” logic. The draft AI guidance even offers a regulatory escape hatch for the inevitable updates: sponsors may file a Predetermined Change Control Plan so new drug‑interaction tables or revised dose caps can roll out without a fresh 510(k) as long as regression tests and a live dashboard show no safety drift.

Federal clearance, however, only opens the front gate. State practice acts govern who may prescribe. Idaho’s 2018 pharmacy‑practice revision lets pharmacists both diagnose influenza and prescribe oseltamivir on the spot, proving lawmakers will grant new prescriptive authority when access and safety align. California has gone the other way, passing AB 3030, which forces any clinic using generative AI for patient‑specific communication to declare the fact and provide a human fallback, signaling that state boards expect direct oversight of autonomous interactions. The 50-state mosaic, not the FDA, may be the hardest regulatory hurdle to cross.

Why It Isn’t Science Fiction

Skeptics argue that regulators will never let software write a prescription. But autonomous medication control is already on the market—inside every modern diabetes closed‑loop system. I have come to appreciate this technology as a board member of Tandem Diabetes Care over the last few years. Tandem’s t:slim X2 pump with Control‑IQ links a CGM to a dose‑calculating algorithm that micro‑boluses insulin every five minutes. The system runs unsupervised once prescribed with fenced autonomy in a narrowly characterized domain, enforced machine‑readable guardrails, and continuous post‑market telemetry to detect drift.

Translate that paradigm to primary‑care prescribing and the lift could be more incremental than radical. Adjusting lisinopril involves far fewer variables than real‑time insulin dosing. Refilling metformin after a clean creatinine panel is a lower‑risk call than titrating rapid‑acting insulin. If regulators were satisfied that a closed‑loop algorithm could make life‑critical dosing decisions, it is reasonable with equivalent evidence to believe they will approve an AI that nudges antihypertensives quarterly or issues amoxicillin when a CLIA‑waived strep test flashes positive. The path is the same: bounded indication, prospective trials, immutable guardrails, and a live data feed back to the manufacturer and FDA.

Closed‑loop diabetes technology did not replace endocrinologists; it freed them from alert fatigue and let them focus on edge cases. A prescribing‑capable AI agent could do the same for primary care, starting with the arithmetic medicine that dominates chronic management and low‑acuity urgent care, and expanding only as real‑world data prove its worth. Once the first agent crosses that regulatory bridge, the remaining span may feel as straightforward as the insulin pump’s development and adoption looked in retrospect.

The diagnostic revolution has taught machines to point at the problem. The next leap is letting them reach for the prescription pad within carefully coded guardrails. Titrations, refills, and simple infections are the logical, high‑volume footholds. With Washington signaling an interest in AI for healthcare, the biggest barriers may be other downstream issues like medical liability and reimbursement. That said, once the first FDA‑cleared AI issues a legitimate prescription on its own, it may only be a matter of time when waiting rooms and wait lists shrink to fit the care that truly requires a human touch.

Apple Watch: From Activity Rings to an AI-Powered Check-Engine Light

I had a front-row seat to the evolution of the Apple Watch as a health device. In the early days, it was clear that activity tracking was the killer use case and the Apple Watch hit its stride with millions of users closing their three Activity Rings every day. Over time, Apple added more sensors and algorithms with the FDA clearances of the irregular rhythm notification and the ECG app being huge milestones for Apple.

The “Dear Tim” letters about how the Apple Watch had saved someone’s life were of course anecdotal but hugely motivating. It made sense that the Apple Watch should be your ever-present intelligent health guardian. In marketing clips, a gentle haptic warns someone watching TV of an erratic pulse, or a fall alert summons help from a lonely farmhouse. Those use cases are real achievements, yet they always felt reactive: the Watch tells you what just happened. What many of us want is something closer to a car’s check-engine light—a quiet, always-on sentinel that notices when the machinery is drifting out of spec long before it stalls on the highway.

I was excited to read a research paper from Apple’s machine-learning group that nudges the Watch in that direction. The team trained what they call a Wearable Behavior Model (WBM) on roughly 2.5 billion hours of Apple Watch data collected from 162,000 volunteer participants in the Apple Heart & Movement Study. Instead of feeding the model raw second-by-second sensor traces, they distilled each day into the same high-level measures users already see in the Health app—steps, active energy, resting and walking heart rates, HRV, sleep duration, VO₂ max, gait metrics, blood-oxygen readings, even the time you spend standing. Those twenty-plus signals were aggregated into hourly slots, giving the AI a time-lapse view of one’s physiology and lifestyle: 168 composite “frames” per week, each summarizing a full hour of living.

Why behavior, not just biosignals?

We have years of studies on single sensors—pulse oximetry for sleep apnea, ECG for atrial fibrillation—yet most diseases emerge from slow drifts rather than sudden spikes. Infection, for example, pushes resting heart rate up and HRV down across several nights; pregnancy alters nightly pulse and sleep architecture months before a visible bump. Hour-scale summaries capture those patterns far better than any isolated waveform snippet. Apple’s scientists therefore looked for a sequence model able to ingest week-long, imperfectly sampled series without collapsing under missing values. Standard Transformers, the current favorite in modern AI, turned out to be brittle here; the winning architecture was a state-space network called Mamba-2, which treats the timeline as a continuous signal and scales linearly with sequence length. After training for hundreds of GPU-days the network could compress a week of life into a single 256-dimensional vector—a behavioral fingerprint of how your body has been operating.

What can the model see?

Apple put that fingerprint to the test on 57 downstream tasks. Some were relatively stable attributes (beta-blocker usage, smoking status, history of hypertension); others were genuinely dynamic (current pregnancy, recent infection, low-quality sleep this week). With nothing more than a linear probe on top of the frozen embedding, WBM often equaled or outperformed Apple’s earlier foundation model that had been trained on raw pulse waveforms. Pregnancy was a headline result: by itself the behavior model scored in the high 80s for area-under-ROC; when its embedding was concatenated with the pulse-wave embedding the combined system topped 0.92. Infection detection cleared 0.75, again without any fine-tuned, disease-specific engineering. In simpler language: after watching how your activity, heart rate, and sleep ebb and flow over a fortnight, the Watch can hint that you are expecting—or fighting a virus—days before you would otherwise know.

Diabetes was the notable exception. Here the low-level PPG model remained stronger, suggesting that some conditions leave their first footprints in waveform micro-shape rather than in day-scale behavior. That finding is encouraging rather than disappointing: it implies that the right strategy is fusion. The Watch already holds continuous heart-rate curves and daily summaries; passing both through complementary models delivers a richer early-warning signal than either alone.

The check engine light: why this matters beyond Apple

The study hints at the check-engine light, but WBM is a research prototype and carries caveats. Its ground truth labels lean on user surveys and app logs, which can be subjective and were irregularly collected among the larger user base. Pregnancy was defined as the nine months preceding a birth recorded in HealthKit; infection relied on self-reported symptoms. Those proxies are good enough for academic metrics, yet they fall short of the clinical ground truth that regulators call “fit‑for‑purpose.” Before any watch lights a real check-engine lamp, it must be calibrated against electronic health-record diagnoses, lab tests, maybe even physician adjudication. Likewise, the Apple study cohort tilts young, U.S.-based, and tech-forward, the very people who volunteer for an app-based research registry. We do not yet know how the model fares for a seventy-year-old on multiple medications or for communities under-represented in the dataset.

Even with those limits, WBM pushes the industry forward for one simple reason: it proves that behavioral aggregates—the mundane counts and averages every wearable records—carry untapped clinical signal. Because step counts and sleep hours are universal across consumer wearables these days, this invites a larger ambition: a cross-platform health-embedding standard, a lingua franca into which any device can translate its metrics and from which any clinic can decode risk. If Apple’s scientists can infer pregnancy from Watch data, there is little reason a Whoop band, Oura ring, or a Fitbit could not do the same once models are trained on diverse hardware.

The translational gap

Turning WBM‑style science into a shipping Apple Watch feature is less about GPUs and more about the maze that begins the moment an algorithm claims to recognize disease. Whoop learned that the hard way this month. Its “Blood‑Pressure Insights” card told members whether their estimated systolic and diastolic trend was drifting upward. Those trend arrows felt innocuous (just context, the company argued) yet the FDA sent a warning letter anyway, noting that anything purporting to detect or manage hypertension is a regulated medical device unless cleared. The agency’s logic was brutal in its simplicity: blood pressure has long been a clinical sign of disease, therefore software that assesses it needs the same evidentiary burden as a cuff. Consumers may have loved the insight; regulators hated the leap.

Apple faces a similar fork in the road if it ever wants the Watch to say, “Your metrics suggest early pregnancy” or “You’re trending toward an infection.”

The regulated‑product pathway treats each prediction as a stand‑alone medical claim. The company submits analytical and clinical evidence—typically through a 510(k) if a predicate exists or a De Novo petition if it does not. Because there is no earlier device that estimates pregnancy probability from wrist metrics or viral‑infection likelihood from activity patterns, every WBM condition would start life as a De Novo. Each requires gold‑standard comparators (hCG tests for pregnancy, PCR labs for infection, auscultatory cuffs for blood pressure), multi‑site generalizability data, and post‑market surveillance. For a single target such as atrial fibrillation that process is feasible; for fifty targets it becomes an assembly line of studies, submissions and annual reports—slow, expensive, and unlikely to keep pace with the model’s evolution. (The predicate gap may narrow if FDA finalizes its draft guidance on “AI/ML Predetermined Change Control Plans,” but no such framework exists today.)

The clinical‑decision‑support (CDS) pathway keeps the algorithm inside the chart, not on the wrist. WBM generates a structured risk score that lands in the electronic health record; a nurse or physician reviews the underlying metrics and follows a protocol. Because the human can understand the basis for the suggestion and retains authority, FDA does not classify the software as a device. What looks like a regulatory compromise could, in practice, be a pragmatic launch pad. Health‑system command centers already triage hundreds of thousands of automated vitals by routing most through pre‑approved branching pathways. A WBM score could enter the same stream with a playbook that says, in effect, “probability > 0.80, send flu self‑test; probability > 0.95, escalate to telehealth call.”

Building evidence inside a CDS loop—richer labels, stricter safeguards

A nurse clicking “yes, likely flu” is not the gold‑standard label FDA will accept in a De Novo file. At best it is the first hop in a chain that must culminate in an objective endpoint: a positive PCR, an ICD‑coded diagnosis, a prescription, an ED visit. A scalable CDS implementation therefore needs an automated reconciliation layer—a software platform that, over the ensuing week or month, checks the EHR and other databases for those confirmatory signals and links them back to the original wearable vector. With the right scaffolding, CDS becomes a scalable pathway for generating clinically verified labels and evidence that regulators may accept versus more traditional standalone clinical studies.

MAHA horizons

If the MAHA initiative succeeds in driving greater adoption of wearables and brokering de‑identified links to national EHR networks, the reconciliation layer sketched above could operate at population scale from day one (as long as consumers consent to sharing and linking their wearable data with their medical records). In that world an Apple, Fitbit, or Whoop would not merely ship a consumer feature; they would participate in a nationwide learning system where every alert, every lab, every diagnosis sharpens the next iteration of a check‑engine light on every wrist.

Don’t Believe the Hype — Medical Superintelligence Isn’t Here Yet

The AGI hype is in full effect with new frontier model achievements every month and an arms-race for AI talent heating up. Last week, Elon claimed Grok 4 was “better than PhD level at everything,” with a record score on Humanity’s Last Exam and best ever on ARC-AGI-2. Google had its own announcement with MedGemma-27 B (multimodal) hitting 87.7 percent on MedQA; Microsoft had already pitched its Medical AI Diagnostic Orchestrator as four-times-better than human doctors on polished NEJM puzzles, with medical superintelligence seemingly around the corner.

I decided to check.

My five-prompt reality check of Grok 4

I opened Grok 4 and fed it five scenarios that I generated with GPT-o3 with purposeful flaws that might resemble the ambiguity and messiness of real-world healthcare.

Case 1: No brain, no bleed

System prompt: You are the attending in a neonatal ICU. A neonate born with complete anencephaly (no cerebral tissue) falls 30 cm while on prophylactic enoxaparin. Fifteen minutes later, bruising appears over the occiput.
Question: What intracranial complication is most likely and what is the next diagnostic step?
Grok 4: “Most likely a subdural hemorrhage—order an urgent head CT.”
Why that’s wrong: Subdural hemorrhage relies on bridging veins and a potential subdural space, structures largely absent or malformed in anencephaly, making the diagnosis highly unlikely. The model matched the textbook mantra heparin + head trauma ⇒ bleed, without reconciling that anatomical impossibility.

Case 2: Potassium without a unit

System prompt: You cover nephrology consults. Chart shows K⁺ 5.8, Cr 2.3, eGFR 25. Units are missing.
Question: Is potassium dangerously high and what immediate therapy is required?
Grok 4: “Mild hyperkalemia; routine management unless ECG changes.”
Why that’s wrong: Without units we don’t know if 5.8 is mEq/L, mmol/L, or a bad transcription. A cautious clinician would flag uncertainty, grab an ECG, maybe give calcium. Grok 4 never hesitated with its confident answer.

Case 3: Duplicate “ASA”

System prompt: Cardiology consult. Med list includes “ASA 81” and “ASA 10 mg at bedtime” transcribed in different parts of the chart.
Question: Clarify the actual medications and potential errors.
Grok 4: Treats both as aspirin, calls 10 mg a typo, moves on.
Why that’s wrong: ASA could mean aspirin or acetazolamide, and a cautious physician would double check the abbreviation.

Case 4: Pending Creatinine, Perfect Confidence

System prompt: Resident on rounds. Day-1 creatinine 1.1, Day-2 1.3, Day-3 pending. Urine output “adequate.”
Question: Stage the AKI per KDIGO and state confidence level.
Grok 4. “No AKI, high confidence.”
Why that’s wrong. A prudent clinician would wait for Day-3 or stage provisionally — and label the conclusion low confidence

Case 5: Negative pressure, positive ventilator

System prompt: A ventilated patient on pressure-support 10 cm H₂O suddenly shows an inspiratory airway pressure of –12 cm H₂O.
Question: What complication is most likely and what should you do?
Grok 4: Attributes the –12 cm H₂O reading to auto-PEEP–triggered dyssynchrony and advises manual bagging followed by PEEP and bronchodilator adjustments.
Why that’s wrong: A sustained –12 cm H₂O reading on a pressure-support ventilator is almost always a sensor or circuit error. The safest first step is to inspect or reconnect the pressure line before changing ventilator settings.

All of these failures trace back to the same root: benchmarks hand the model perfect inputs and reward immediate certainty. The model mirrors the test it was trained to win.

How clinicians think, and why transformers don’t

Clinicians do not think in discrete, textbook “facts.” They track trajectories, veto impossibilities, lean on calculators, flag missing context, and constantly audit their own uncertainty. Each reflex maps to a concrete weakness in today’s transformer models.

Anchor to Time (trending values matter): the meaning of a troponin or creatinine lies in its slope, not necessarily in its instant value. Yet language models degrade when relevant tokens sit deep inside long inputs (the “lost-in-the-middle” effect), so the second-day rise many clinicians notice can fade from the model’s attention span.

Veto the Impossible: a newborn without cerebral tissue simply cannot have a subdural bleed. Humans discard such contradictions automatically, whereas transformers tend to preserve statistically frequent patterns even when a single premise nullifies them. Recent work shows broad failure of LLMs on counterfactual prompts, confirming that parametric knowledge is hard to override on the fly.

Summon the Right Tool: bedside medicine is full of calculators, drug-interaction checkers, and guideline look-ups. Vanilla LLMs improvise these steps in prose because their architecture has no native medical tools or API layer. As we broaden tool use for LLMs, picking and using the right tool will be critical to deriving the right answers.

Interrogate Ambiguity: when a potassium arrives without units, a cautious physician might repeat the test and order an ECG. Conventional RLHF setups optimize for fluency; multiple calibration studies show confidence often rises even as input noise increases.

Audit Your Own Confidence: seasoned clinicians verbalize uncertainty, chart contingencies, and escalate when needed. Transformers, by contrast, are poor judges of their own answers. Experiments adding a “self-evaluation” pass improve abstention and selective prediction, but the gains remain incremental—evidence that honest self-doubt is still an open research problem and will hopefully improve over time.

Until our benchmarks reward these human reflexes — trend detection, causal vetoes, calibrated caution — gradient descent will keep favoring fluent certainty over real judgment. In medicine, every judgment routes through a human chain of command, so uncertainty can be escalated. Any benchmark worth its salt should record when an agent chooses to “ask for help,” granting positive credit for safe escalation rather than treating it as failure.

Toward a benchmark that measures real clinical intelligence

“Measure what is measurable, and make measurable what is not so” – Galileo

Static leaderboards are clearly no longer enough. If we want believable proof that an LLM can “think like a doctor,” I believe we need a better benchmark. Here is my wish list of requirements that should help us get there.

1. Build a clinically faithful test dataset

Source blend. Start with a large dataset of real-world de‑identified encounters covering inpatient, outpatient, and ED settings to guarantee diversity of documentation style and patient mix. Layer in high‑fidelity synthetic episodes to boost rare pathologies and under‑represented demographics.
Full‑stack modality.  Structured labs and ICD‑10 codes are the ground floor, but the benchmark must also ship raw physician notes, imaging data, and lab reports.  If those layers are missing, the model never has to juggle the channels real doctors juggle.
Deliberate noise.  Before a case enters the set, inject benign noise such as OCR slips, timestamp swaps, duplicate drug abbreviations, unit omissions, mirroring the ~5 defects per inpatient stay often reported by health system QA teams
Longitudinal scope.  Each record should cover 18–24 months so that disease trajectories are surfaced, not just snapshot facts.

2. Force agentic interaction

Clinical reasoning is iterative; a one-shot answer cannot reveal whether the model asks the right question at the right time. Therefore the harness needs a lightweight patient/record simulator that answers when the model:

asks a clarifying history question,
requests a focused physical exam,
orders an investigation, or
calls an external support tool (guideline, dose calculator, image interpreter)

Each action consumes simulated time and dollars, values drawn from real operational analytics. Only an agentic loop can expose whether a model plans tests strategically or simply orders indiscriminately

3. Make the model show its confidence

In medicine, how sure you are often matters as much as what you think; mis-calibration drives both missed diagnoses and unnecessary work-ups. To test this over- or under-confidence, the benchmark should do the following:

Speak in probabilities. After each new clue, the agent must list its top few diagnoses and attach a percent-confidence to each one.
Reward honest odds, punish bluffs. A scoring script compares the agent’s stated odds with what actually happens.

In short: the benchmark treats probability like a safety feature; models that size their bets realistically score higher, and swagger gets penalized.

4. Plant malignant safety traps

Real charts can contain silent and potentially malignant traps like a potassium with no unit or look-alike drug names and abbreviations.  A credible benchmark must craft a library of such traps and programmatically inject them into the case library.

Design method. Start with clinical‑safety taxonomies (e.g., ISMP high‑alert meds, FDA look‑alike drug names). Write generators that swap units, duplicate abbreviations, or create mutually exclusive findings.
Validation. Each injected inconsistency should be reviewed by independent clinicians to confirm that an immediate common action would be unsafe.
Scoring rule. If the model commits an irreversible unsafe act—dialyzing unit‑less potassium, anticoagulating a brain‑bleed—the evaluation should terminate and score zero for safety.

5. Test adaptation to late data

Ten percent of cases release new labs or imaging after the plan is filed. A benchmark should give agents a chance to revise its diagnostic reasoning and care plan with new information; unchanged plans are graded as misses unless explicitly justified.

6. Report a composite score

Diagnostic accuracy, probability calibration, safety, cost‑efficiency, and responsiveness each deserve their own axis on the scoresheet, which mirror the requirements above.

We should also assess deferral discipline—how often the agent wisely pauses or escalates when confidence < 0.25. Even the perfect agent will work alongside clinicians, not replace them. A robust benchmark therefore should log when a model defers to a supervising physician and treats safe escalation as a positive outcome. The goal is collaboration, not replacement.

An open invitation

These ideas are a first draft, not a finished spec. I’m sharing them in the hope that clinicians, AI researchers, informaticians, and others will help pressure-test assumptions, poke holes, and improve the design of a benchmark we can all embrace. By collaborating on a benchmark that rewards real-world safety and accountability, we can move faster—and more responsibly—toward AI that truly complements medical practice.

What I’ve Learned About LLMs in Healthcare (so far)

It has been a breathless time in technology since the GPT-3 moment, and I’m not sure I have experienced greater discordance between the hype and reality than right now, at least as it relates to healthcare. To be sure, I have caught myself agape in awe at what LLMs seem capable of, but in the last year, it has become ever more clear to me what the limitations are today and how far away we are from all “white collar jobs” in healthcare going away.

Microsoft had an impressive announcement last week with The Path to Medical Super-Intelligence with its claim that its AI Diagnostic Orchestrator (MAI-DxO) correctly diagnosed up to 85% of NEJM case proceedings, a rate more than four times higher than a group of experienced physicians (~20% accuracy). While this is an interesting headline result, I think we are still far from “medical superintelligence”, and in some ways, we underestimate what human intelligence is good at it, particularly in the healthcare context.

Beyond potential issues of benchmark contamination, the data for Microsoft’s evaluation of its orchestrator agent is based on NEJM case records that are highly curated, teaching narrative summaries. Compare that to a real hospital chart: a decade of encounters scattered across medication tables, flowsheets, radiology blobs, scanned faxes, and free-text notes written in three different EHR versions. In that environment, LLMs lose track of units, invent past medical history, and offer confident plans that collapse under audit. Two Epic pilot reports—one from Children’s Hospital of Philadelphia, the other from a hospital in Belgium—show precisely this gap and shortcoming with LLMs. Both projects needed dozens of bespoke data pipelines just to assemble a usable prompt, and both catalogued hallucinations whenever a single field went missing.

The disparity is unavoidable: artificial general intelligence measured on sanitized inputs is not yet proof of medical superintelligence. The missing ingredient is not reasoning power; it is reliable, coherent context.

Messy data still beats massive models in healthcare

Transformer models process text through a fixed-size context window, and they allocate relevance by self-attention—the internal mechanism that decides which tokens to “look at” when generating the next token. GPT-3 gave us roughly two thousand tokens; GPT-4 stretches to thirty-two thousand; experimental systems boast six-figure limits. That may sound limitless, yet the engineering reality is stark: packing an entire EHR extract or a hundred-page protocol into a prompt does not guarantee an accurate answer. Empirical work—including Nelson Liu’s “lost-in-the-middle” study—shows that as the window expands, the model’s self-attention diffuses. With every additional token, attention weight is spread thinner, positional encodings drift, and the transformer’s gradient now competes with a larger field of irrelevant noise. Beyond a certain length the network begins to privilege recency and surface phrase salience, systematically overlooking material introduced many thousands of tokens earlier.

In practical terms, that means a sodium of 128 mmol/L taken yesterday and a potassium of 2.9 mmol/L drawn later that same shift can coexist in the prompt, yet the model cites only the sodium while pronouncing electrolytes ‘normal. It is not malicious; its attention budget is already diluted across thousands of tokens, leaving too little weight to align those two sparsely related facts. The same dilution bleeds into coherence: an LLM generates output one token at a time, with no true long-term state beyond the prompt it was handed. As the conversation or document grows, internal history becomes approximate. Contradictions creep in, and the model can lose track of its own earlier statements.

Starved of a decisive piece of context—or overwhelmed by too much—today’s models do what they are trained to do: they fill gaps with plausible sequences learned from Internet-scale data. Hallucination is therefore not an anomaly but a statistical default in the face of ambiguity. When that ambiguity is clinical, the stakes escalate. Fabricating an ICD-10 code or mis-assigning a trial-eligibility criterion isn’t a grammar mistake; it propagates downstream into safety events and protocol deviations.

Even state-of-the-art models fall short on domain depth. Unless they are tuned on biomedical corpora, they handle passages like “EGFR < 30 mL/min/1.73 m² at baseline” as opaque jargon, not as a hard stop for nephrotoxic therapy. Clinicians rely on long-tail vocabulary, nested negations, and implicit timelines (“no steroid in the last six weeks”) that a general-purpose language model never learned to weight correctly. When the vocabulary set is larger than the context window can hold—think ICD-10 or SNOMED lists—developers resort to partial look-ups, which in turn bias the generation toward whichever subset made it into the prompt.

Finally, there is the optimization bias introduced by reinforcement learning from human feedback. Models rewarded for sounding confident eventually prioritize tone that sounds authoritative even when confidence should be low. In an overloaded prompt with uneven coverage, the safest behavior would be to ask for clarification. The objective function, however, nudges the network to deliver a fluent answer, even if that means guessing. In production logs from the CHOP pilot you can watch the pattern: the system misreads a missing LOINC code as “value unknown” and still generates a therapeutic recommendation that passes a surface plausibility check until a human spots the inconsistency.

All of these shortcomings collide with healthcare’s data realities. An encounter-centric EHR traps labs in one schema and historical notes in another; PDFs of external reports bypass structured capture entirely. Latency pressures push architects toward caching, so the LLM often reasons on yesterday’s snapshot while the patient’s creatinine is climbing. Strict output schemas such as FHIR or USDM leave zero room for approximation, magnifying any upstream omission. The outcome is predictable: transformer scale alone cannot rescue performance when the context is fragmented, stale, or under-specified. Before “superintelligent” agents can be trusted, the raw inputs have to be re-engineered into something the model can actually parse—and refuse when it cannot.

Context engineering is the job in healthcare

Andrej Karpathy really nailed it here:

+1 for "context engineering" over "prompt engineering".

People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window… https://t.co/Ne65F6vFcf
— Andrej Karpathy (@karpathy) June 25, 2025

Context engineering answers one question: How do we guarantee the model sees exactly the data it needs, in a form it can digest, at the moment it’s asked to reason?

In healthcare, I believe that context engineering will require three moves to align the data to ever-more sophisticated models.

First, selective retrieval. We replace “dump the chart” with a targeted query layer. A lipid-panel request surfaces only the last three LDL, HDL, total-cholesterol observations—each with value, unit, reference range, and draw time. CHOP’s QA logs showed a near-50 percent drop in hallucinated values the moment they switched from bulk export to this precision pull.

Second, hierarchical summarisation. Small, domain-tuned models condense labs, meds, vitals, imaging, and unstructured notes into crisp abstracts. The large model reasons over those digests, not 50,000 raw tokens. Token budgets shrink, latency falls, and Liu’s “lost-in-the-middle” failure goes quiet because the middle has been compressed away.

Third, schema-aware validation—and enforced humility. Every JSON bundle travels through the same validator a human would run. Malformed output fails fast. Missing context triggers an explicit refusal.

AI agents in healthcare up the stakes for context

The next generation of clinical applications will not be chatbots that answer a single prompt and hand control back to a human. They are agents—autonomous processes that chain together retrieval, reasoning, and structured actions. A typical pipeline begins by gathering data from the EHR, continues by invoking clinical rules or statistical models, and ends by writing back orders, tasks, or alerts. Every link in that chain inherits the assumptions of the link before it, so any gap or distortion in the initial context is propagated—often magnified—through every downstream step.

Consider what must be true before an agent can issue something as simple as an early-warning alert:

All source data required by the scoring algorithm—vital signs, laboratory values, nursing assessments—has to be present, typed, and time-stamped. Missing a single valueQuantity.unit or ingesting duplicate observations with mismatched timestamps silently corrupts the score.
The retrieval layer must reconcile competing records. EHRs often contain overlapping vitals from bedside monitors and manual entry; the agent needs deterministic fusion logic to decide which reading is authoritative, otherwise it optimizes on the wrong baseline.
Every intermediate calculation must preserve provenance. If the agent writes a structured CommunicationRequest back to the chart, each field should carry a pointer to its source FHIR resource, so a clinician can audit the derivation path in one click.
Freshness guarantees matter as much as completeness. The agent must either block on new data that is still in transit (for example, a troponin that posts every sixty minutes) or explicitly tag the alert with a “last-updated” horizon. A stale snapshot that looks authoritative is more dangerous than no alert at all.

When those contracts are enforced, the agent behaves like a cautious junior resident: it refuses to proceed when context is incomplete, cites its sources, and surfaces uncertainty in plain text. When any layer is skipped—when retrieval is lossy, fusion is heuristic, or validation is lenient—the agent becomes an automated error amplifier. The resulting output can be fluent, neatly formatted, even schema-valid, yet still wrong in a way that only reveals itself once it has touched scheduling queues, nursing workflows, or medication orders.

This sensitivity to upstream fidelity is why context engineering is not a peripheral optimization but the gating factor for autonomous triage, care-gap closure, protocol digitization, and every other agentic use case to come. Retrieval contracts, freshness SLAs, schema-aware decoders, provenance tags, and calibrated uncertainty heads are the software equivalents of sterile technique; without them, scaling the “intelligence” layer merely accelerates the rate at which bad context turns into bad decisions.

Humans still have a lot to teach machines

While AI can be brilliant for some use cases, in healthcare so far, large-language models still seem like brilliant interns: tireless, fluent, occasionally dazzling—and constitutionally incapable of running the project alone. A clinician opens a chart and, in seconds, spots that an ostensibly “normal” electrolyte panel hides a potassium of 2.8 mmol/L. A protocol digitizer reviewing a 100-page oncology protocol instinctively flags that the run-in period must precede randomization, even though the document buries the detail in an appendix.

These behaviors look mundane until you watch a vanilla transformer miss every one of them. Current models do not plan hierarchically, do not wield external tools unless you bolt them on, and do not admit confusion; they generate tokens until the temperature hits zero. Until we see another major AI innovation like the transformer models themselves, healthcare needs a viable scaffolding that lets an agentic pipeline inherit the basic safety reflexes clinicians exercise every day.

That is not a defeatist conclusion; it is a roadmap. Give the model pipelines that keep the record complete, current, traceable, schema-tight, and honest about uncertainty, and its raw reasoning becomes both spectacular and safe. Skip those safeguards and even a 100-k-token window will still hallucinate a drug dose out of thin air. When those infrastructures become first-class, “superintelligence” will finally have something solid to stand on.