The Humbling Math of Health AI: Why ChatGPT Can't Grade Your Heart Yet

During my time at Apple, I developed deep respect for the engineers and data scientists working on algorithms for the Apple Watch. Watching them develop features like irregular rhythm detection for atrial fibrillation gave me an appreciation for just how hard this work is. The Apple Watch generates continuous, high-frequency data streams: heart rate sampled throughout the day, HRV measured during sleep, motion data at sub-second intervals. The sheer volume creates an illusion of richness. But buried in that firehose is tremendous noise—motion artifacts, inconsistent wear, environmental confounders, biological variation that has nothing to do with disease.

The teams that made progress did so by ruthlessly constraining the problem. They picked specific hypotheses. They designed algorithms tuned for particular signals, with explicit handling of known noise sources. They validated against clinical ground truth and went through regulatory review to ensure their claims matched the evidence. The result was something like the irregular rhythm notification, a narrow, bounded capability with known performance characteristics.

I thought about those lessons when I read Geoffrey Fowler’s recent Washington Post piece about ChatGPT Health. He connected a decade of Apple Watch data—29 million steps, 6 million heartbeats—and asked the AI to grade his cardiac health. It gave him an F. He panicked. His actual doctor told him he was fine. When he asked again, his grade jumped to a C, then a B. The AI, it turned out, was guessing.

This isn’t a story about ChatGPT necessarily being bad at health. It’s a story about a fundamental mismatch between what large language models reliably do and what longitudinal health analysis requires.

What OpenAI and Anthropic Launched

Earlier this month, OpenAI released ChatGPT Health, allowing users to connect Apple HealthKit data and medical records to their conversations. The pitch: “understand patterns over time—not just moments of illness.” Anthropic followed with Claude’s health integration, promising to help users “detect patterns across fitness and health metrics.”

To their credit, both companies include disclaimers that these tools support rather than replace medical professionals. OpenAI describes physician collaboration in developing the product and emphasizes privacy isolation. These are reasonable positions for an early product.

Yet the interface still happily emits letter grades, an affordance users will naturally interpret as medical assessment. And when Fowler’s grades swung wildly across repeated queries, no disclaimer prevented the anxiety of that initial F.

Why This Is Hard: The Humbling Math

The scale of longitudinal health data defies intuition, even perhaps for AI. A single day of Apple Watch wear might generate hundreds of heart rate samples, dozens of HRV measurements, continuous motion data, sleep stage classifications, and workout metrics. Multiply by ten years and you’re looking at millions of individual data points. Any interface to an LLM must select and aggregate this data—and those aggregation choices become the product. Summarize wrong, and you’ve already lost the signal before the model sees anything.

Here’s the deeper problem: when you export Apple Health data, you often get the raw firehose, including noisy reads that Apple’s own algorithms discard. Apple’s “High Heart Rate” notification isn’t just a database query; it’s a filtered, debounced, validated signal that applies sensor fusion. An LLM analyzing a raw export might see everything but may not apply this judgment that human engineers and clinicians have applied to put the data in the right context.

Where LLMs Struggle by Default

Large language models can describe trends, explain medical concepts, and even write Python scripts to analyze data. But they don’t guarantee statistical correctness unless paired with explicit computation, data-quality checks, and calibrated uncertainty. The issue isn’t that the underlying transformer architecture can’t process sequential data—researchers have adapted transformers for time series with considerable success. The issue is that general-purpose LLMs like GPT and Claude lack the domain-specific inductive biases that reliable health analysis requires.

LLMs don’t have built-in numeric priors for trend estimation, measurement uncertainty, or statistical significance. They can imitate statistical reasoning, but the reliability is jagged. More critically, they have no native concept of clinical recency (how to weight last month versus last year), personal baselines (your normal versus population normal), seasonal patterns, or device-change artifacts (did your HRV shift because of your health, or because you upgraded watches?). Unless you explicitly encode these priors into the system, the model treats your health history as a bag of facts rather than a structured time series.

Even when the model attempts to use tools like Python for analysis, it can fail to structure the problem correctly or hallucinate inappropriate statistical thresholds. And when it falls back to native token processing for conversational requests, it generates plausible-sounding assessments without the statistical grounding those assessments imply. In Fowler’s test, the system produced grades despite obvious data-quality ambiguity, without showing confidence intervals, source weighting, device-change handling, or transparency about exactly what data it used and how.

The Network-to-Product Gap

Andrej Karpathy recently introduced a concept during his talk at Y Combinator that captures what’s happening here: the “network-to-product gap.”

He tells a story about riding in a self-driving car in 2013. For 40 minutes around Palo Alto, the drive was flawless. He thought autonomous vehicles were imminent. Twelve years later, we’re still working on it and perhaps are almost there. The demo worked; the product took a while.

His insight: “Demos only require that anything works. But in many cases, lots of things must work for people to actually use the product. This is especially the case in high-reliability areas.”

Healthcare is most definitely high-reliability. An AI that gives you an F for cardiac health when you’re fine causes real harm—anxiety, unnecessary medical visits, erosion of trust. And the inverse is equally dangerous: reassuring someone at genuine risk.

Karpathy advocates for “partial autonomy products” with an “autonomy slider”—systems where the AI does more as tasks get simpler and reliability gets higher, but humans stay in the loop for high-stakes decisions. His exemplar is Cursor, the coding assistant, which does extensive context management, orchestrates multiple specialized model calls, provides application-specific UIs for fast verification, and crucially, keeps the AI “on a leash”—constraining outputs based on what it can reliably do.

The health products that both OpenAI and Claude shipped are somewhat unfinished in this regard. In Fowler’s experience, the system happily produced cardiac grades despite data-quality issues that should have triggered uncertainty, without the visual interfaces or confidence bounds that would let a user verify the assessment.

The Unbounded Magic Box Problem

Both companies pitched their health products as nearly unbounded—connect your data, ask any health question, get personalized insights. The marketing implied these systems could do for health what ChatGPT does for writing.

But “write me a blog post” and “analyze a decade of my biometric data for concerning patterns” are categorically different tasks. The first plays to LLM strengths: synthesizing information, generating coherent text. The second requires capabilities that don’t emerge reliably from language model training: rigorous trend detection, individual baseline computation, noise filtering, uncertainty quantification.

This is jagged intelligence in action. LLMs are superhuman at some tasks and make mistakes no human would make at others. In healthcare, the jaggedness is particularly dangerous because conversational fluency creates false confidence. The model sounds authoritative because it genuinely knows a lot about health in general. But reasoning about your specific longitudinal data requires statistical discipline the model doesn’t reliably apply.

The self-driving analogy holds. A perfect demo drive doesn’t mean the technology works everywhere. A plausible health assessment doesn’t mean the assessment is reliable. The gap between “looks like it works” and “actually works across real-world cases” can take years of focused engineering to close.

Where LLMs Genuinely Help

None of this means LLMs are useless for health. They’re transformative in the right contexts. Explaining what VO2 max means, how HRV works, what lab values indicate. Synthesizing information across sources, helping someone understand a new diagnosis, preparing questions for a doctor. Answering bounded analytical questions like “how did my step count change after I had kids?” These tasks play to the model’s actual strengths in language understanding, information synthesis, and education. Fowler noted he liked using ChatGPT Health for plots and narrow questions. That’s exactly right—the tool helps when the task matches its capabilities.

The problem is when the interface invites unbounded health questions and the model confidently answers them anyway.

The Path Forward

Karpathy ended his talk with an image from Iron Man: the suit can be augmentation (human in control, AI assisting) or autonomous agent (AI acting independently). His advice: “It’s less Iron Man robots and more Iron Man suits. Less flashy demos of autonomous agents, more partial autonomy products.”

For health AI, this means building systems where the LLM helps you understand your data and prepare better questions for your actual doctor, not yet a system that can autonomously give you a letter grade for cardiac health. The underlying technology is powerful. But power without appropriate constraints isn’t helpful in high-reliability domains. The teams I watched succeed at Apple had the discipline to narrow the problem until the signal was clear and the claims matched the evidence.

ChatGPT Health and Claude Health, as Fowler tested them, represent a different approach: broad capability claims, minimal output constraints, reliability that varies query to query. The scale of consumer use of these LLMs for health queries is undeniable right now. There is hard work ahead for all companies, including OpenAI and Anthropic, to build systems that match the seriousness of the questions being asked. The demand is real, the technology is genuinely powerful, and the potential to democratize health understanding is significant. The question is whether we do the harder work of earning trust through reliability, or settle for the appearance of capability.

One thought on “The Humbling Math of Health AI: Why ChatGPT Can’t Grade Your Heart Yet”

tomeghale says:

January 28, 2026 at 6:56 pm

This is a fabulously articulate description of what is required for a reliable fusion of AI and Health, let along healthcare. Bravo.

Loading...

The Humbling Math of Health AI: Why ChatGPT Can’t Grade Your Heart Yet

What OpenAI and Anthropic Launched

Why This Is Hard: The Humbling Math

Where LLMs Struggle by Default

The Network-to-Product Gap

The Unbounded Magic Box Problem

Where LLMs Genuinely Help

The Path Forward

Like this:

Related

Published by Myoung Cha

One thought on “The Humbling Math of Health AI: Why ChatGPT Can’t Grade Your Heart Yet”

Leave a ReplyCancel reply

What OpenAI and Anthropic Launched

Why This Is Hard: The Humbling Math

Where LLMs Struggle by Default

The Network-to-Product Gap

The Unbounded Magic Box Problem

Where LLMs Genuinely Help

The Path Forward

Share this:

Like this:

Related

Published by Myoung Cha

One thought on “The Humbling Math of Health AI: Why ChatGPT Can’t Grade Your Heart Yet”

Leave a ReplyCancel reply

Discover more from Be Curious, Not Judgmental