I had a front-row seat to the evolution of the Apple Watch as a health device. In the early days, it was clear that activity tracking was the killer use case and the Apple Watch hit its stride with millions of users closing their three Activity Rings every day. Over time, Apple added more sensors and algorithms with the FDA clearances of the irregular rhythm notification and the ECG app being huge milestones for Apple.
The “Dear Tim” letters about how the Apple Watch had saved someone’s life were of course anecdotal but hugely motivating. It made sense that the Apple Watch should be your ever-present intelligent health guardian. In marketing clips, a gentle haptic warns someone watching TV of an erratic pulse, or a fall alert summons help from a lonely farmhouse. Those use cases are real achievements, yet they always felt reactive: the Watch tells you what just happened. What many of us want is something closer to a car’s check-engine light—a quiet, always-on sentinel that notices when the machinery is drifting out of spec long before it stalls on the highway.
I was excited to read a research paper from Apple’s machine-learning group that nudges the Watch in that direction. The team trained what they call a Wearable Behavior Model (WBM) on roughly 2.5 billion hours of Apple Watch data collected from 162,000 volunteer participants in the Apple Heart & Movement Study. Instead of feeding the model raw second-by-second sensor traces, they distilled each day into the same high-level measures users already see in the Health app—steps, active energy, resting and walking heart rates, HRV, sleep duration, VO₂ max, gait metrics, blood-oxygen readings, even the time you spend standing. Those twenty-plus signals were aggregated into hourly slots, giving the AI a time-lapse view of one’s physiology and lifestyle: 168 composite “frames” per week, each summarizing a full hour of living.
Why behavior, not just biosignals?
We have years of studies on single sensors—pulse oximetry for sleep apnea, ECG for atrial fibrillation—yet most diseases emerge from slow drifts rather than sudden spikes. Infection, for example, pushes resting heart rate up and HRV down across several nights; pregnancy alters nightly pulse and sleep architecture months before a visible bump. Hour-scale summaries capture those patterns far better than any isolated waveform snippet. Apple’s scientists therefore looked for a sequence model able to ingest week-long, imperfectly sampled series without collapsing under missing values. Standard Transformers, the current favorite in modern AI, turned out to be brittle here; the winning architecture was a state-space network called Mamba-2, which treats the timeline as a continuous signal and scales linearly with sequence length. After training for hundreds of GPU-days the network could compress a week of life into a single 256-dimensional vector—a behavioral fingerprint of how your body has been operating.
What can the model see?
Apple put that fingerprint to the test on 57 downstream tasks. Some were relatively stable attributes (beta-blocker usage, smoking status, history of hypertension); others were genuinely dynamic (current pregnancy, recent infection, low-quality sleep this week). With nothing more than a linear probe on top of the frozen embedding, WBM often equaled or outperformed Apple’s earlier foundation model that had been trained on raw pulse waveforms. Pregnancy was a headline result: by itself the behavior model scored in the high 80s for area-under-ROC; when its embedding was concatenated with the pulse-wave embedding the combined system topped 0.92. Infection detection cleared 0.75, again without any fine-tuned, disease-specific engineering. In simpler language: after watching how your activity, heart rate, and sleep ebb and flow over a fortnight, the Watch can hint that you are expecting—or fighting a virus—days before you would otherwise know.
Diabetes was the notable exception. Here the low-level PPG model remained stronger, suggesting that some conditions leave their first footprints in waveform micro-shape rather than in day-scale behavior. That finding is encouraging rather than disappointing: it implies that the right strategy is fusion. The Watch already holds continuous heart-rate curves and daily summaries; passing both through complementary models delivers a richer early-warning signal than either alone.
The check engine light: why this matters beyond Apple
The study hints at the check-engine light, but WBM is a research prototype and carries caveats. Its ground truth labels lean on user surveys and app logs, which can be subjective and were irregularly collected among the larger user base. Pregnancy was defined as the nine months preceding a birth recorded in HealthKit; infection relied on self-reported symptoms. Those proxies are good enough for academic metrics, yet they fall short of the clinical ground truth that regulators call “fit‑for‑purpose.” Before any watch lights a real check-engine lamp, it must be calibrated against electronic health-record diagnoses, lab tests, maybe even physician adjudication. Likewise, the Apple study cohort tilts young, U.S.-based, and tech-forward, the very people who volunteer for an app-based research registry. We do not yet know how the model fares for a seventy-year-old on multiple medications or for communities under-represented in the dataset.
Even with those limits, WBM pushes the industry forward for one simple reason: it proves that behavioral aggregates—the mundane counts and averages every wearable records—carry untapped clinical signal. Because step counts and sleep hours are universal across consumer wearables these days, this invites a larger ambition: a cross-platform health-embedding standard, a lingua franca into which any device can translate its metrics and from which any clinic can decode risk. If Apple’s scientists can infer pregnancy from Watch data, there is little reason a Whoop band, Oura ring, or a Fitbit could not do the same once models are trained on diverse hardware.
The translational gap
Turning WBM‑style science into a shipping Apple Watch feature is less about GPUs and more about the maze that begins the moment an algorithm claims to recognize disease. Whoop learned that the hard way this month. Its “Blood‑Pressure Insights” card told members whether their estimated systolic and diastolic trend was drifting upward. Those trend arrows felt innocuous (just context, the company argued) yet the FDA sent a warning letter anyway, noting that anything purporting to detect or manage hypertension is a regulated medical device unless cleared. The agency’s logic was brutal in its simplicity: blood pressure has long been a clinical sign of disease, therefore software that assesses it needs the same evidentiary burden as a cuff. Consumers may have loved the insight; regulators hated the leap.
Apple faces a similar fork in the road if it ever wants the Watch to say, “Your metrics suggest early pregnancy” or “You’re trending toward an infection.”
The regulated‑product pathway treats each prediction as a stand‑alone medical claim. The company submits analytical and clinical evidence—typically through a 510(k) if a predicate exists or a De Novo petition if it does not. Because there is no earlier device that estimates pregnancy probability from wrist metrics or viral‑infection likelihood from activity patterns, every WBM condition would start life as a De Novo. Each requires gold‑standard comparators (hCG tests for pregnancy, PCR labs for infection, auscultatory cuffs for blood pressure), multi‑site generalizability data, and post‑market surveillance. For a single target such as atrial fibrillation that process is feasible; for fifty targets it becomes an assembly line of studies, submissions and annual reports—slow, expensive, and unlikely to keep pace with the model’s evolution. (The predicate gap may narrow if FDA finalizes its draft guidance on “AI/ML Predetermined Change Control Plans,” but no such framework exists today.)
The clinical‑decision‑support (CDS) pathway keeps the algorithm inside the chart, not on the wrist. WBM generates a structured risk score that lands in the electronic health record; a nurse or physician reviews the underlying metrics and follows a protocol. Because the human can understand the basis for the suggestion and retains authority, FDA does not classify the software as a device. What looks like a regulatory compromise could, in practice, be a pragmatic launch pad. Health‑system command centers already triage hundreds of thousands of automated vitals by routing most through pre‑approved branching pathways. A WBM score could enter the same stream with a playbook that says, in effect, “probability > 0.80, send flu self‑test; probability > 0.95, escalate to telehealth call.”
Building evidence inside a CDS loop—richer labels, stricter safeguards
A nurse clicking “yes, likely flu” is not the gold‑standard label FDA will accept in a De Novo file. At best it is the first hop in a chain that must culminate in an objective endpoint: a positive PCR, an ICD‑coded diagnosis, a prescription, an ED visit. A scalable CDS implementation therefore needs an automated reconciliation layer—a software platform that, over the ensuing week or month, checks the EHR and other databases for those confirmatory signals and links them back to the original wearable vector. With the right scaffolding, CDS becomes a scalable pathway for generating clinically verified labels and evidence that regulators may accept versus more traditional standalone clinical studies.
MAHA horizons
If the MAHA initiative succeeds in driving greater adoption of wearables and brokering de‑identified links to national EHR networks, the reconciliation layer sketched above could operate at population scale from day one (as long as consumers consent to sharing and linking their wearable data with their medical records). In that world an Apple, Fitbit, or Whoop would not merely ship a consumer feature; they would participate in a nationwide learning system where every alert, every lab, every diagnosis sharpens the next iteration of a check‑engine light on every wrist.