Strategy in the Age of Infinite Slop

“AI is going to replace McKinsey.”

It’s a popular dunk on AI Twitter. The logic is seductive: if a model can generate a Porter’s Five Forces diagram and a perfectly serviceable deck in seconds, why pay millions for a team of human analysts to take six weeks?

I spent more than ten years at McKinsey working on the exact problems assumed to be next on the chopping block: the ultimate open-ended questions of “where to play” and “how to win.” But looking at those problems through the lens of Andrej Karpathy’s concept of “verifiability,” I’ve come to the opposite conclusion: the closer you get to real strategy, the harder it is for AI to replace it.

The closer you get to real strategy, the less it resembles the tasks AI is good at.

Karpathy’s “Software 2.0” thesis is simple: AI mastery relies on a loop. Attempt a task, get a score, reset, repeat. If you can verify the outcome cheaplyDid the code compile? Did the math hold? Did you win the game?—the model can practice its way to superhuman performance.

This explains why AI is crushing coding and math. These are “high verifiability” domains. The reward signal is crisp, binary, and instant.

Corporate strategy lives at the opposite extreme.

As a strategy consultant, when you advise a client to enter China or divest a legacy unit or to sell the company, you don’t get a clean error message immediately. You get a noisy stream of signals over five years. A competitor takes share with a new product. The macro environment shifts. A new CEO gets hired.

You cannot reset the world. You cannot run the A/B test. There is only one realized future and a graveyard of unknowable counterfactuals. And from the perspective of an AI training loop, that means the “reward signal” for any one decision is sparse, delayed, and hopelessly entangled with everything else. The pattern recognition for “good strategy” gets developed over years of many studies and outcomes.

So, is AI useless in the boardroom?

Absolutely not. While AI cannot verify a strategy, it is unparalleled at generating the raw material for one.

Strategy is fundamentally a game of connecting dots across a massive, messy board. It requires looking at a mountain of proprietary data, market reports, and competitive intelligence, and spotting the pattern that others miss.

This is where modern LLMs shine. They act as a force multiplier for reasoning by analogy. A partner can ask a model to look at a B2B logistics problem and apply the “physics” of a consumer marketplace, or to search for historical parallels for the AI infrastructure buildout in 19th-century rail monopolies.

In this phase, the AI is not an oracle; it is a Disciplined Hallucinator. It provides the expanse. It widens the aperture from three conventional options to twenty wild ones. It does the “grinder” work of synthesis that used to burn out armies of business analysts. A lot of those options will be wrong, implausible, or “slop” in the eyes of critics, but in strategy, exploring wrong futures is often how you discover the few worth betting on.

But options are not decisions.

There is a distinct limit to how far this can go. As AI researchers like Yann LeCun argue, current LLMs are not “World Models.” They predict the next token in a sequence; they do not understand the underlying causal physics of reality. They cannot reason about cause and effect in a chaotic environment because they have no internal representation of how the world actually works.

They can simulate the text of a strategy, but they cannot simulate the reality of its execution.

This means the “Silicon Partner” isn’t arriving anytime soon. Until AI creates a true internal model of the world—one that understands human psychology, political friction, and temporal consequences—it remains a statistical engine, not a strategic one in the strong sense.

The Shift: From Processing to Judgment

As AI automates the verifiable layer of intelligence—the analysis, the synthesis, the slide-making—the value of the remaining bottleneck skyrockets.

That bottleneck is Judgment.

Judgment is the ability to look at the AI’s twenty generated options and intuitively know which three will survive contact with reality. It is the ability to stare down an irreversible decision where the “right” answer is mathematically unknowable—and act anyway.

We aren’t paying consultants to process information anymore. We are paying them to use these new instruments to hallucinate a better future and to have the courage to speak truth to power and own the risk of being wrong.

Health Privacy in the AI Era

Sam Altman hardly ever breaks stride when he talks about ChatGPT, yet in a recent podcast he paused to deliver a blunt warning, which caused my ears to perk up. A therapist might promise that what you confess stays in the room, Sam said, but an AI chatbot cannot, at least not based on the current legal framework. With ~20% of Americans asking an AI chatbot about health monthly (and rising), this is a big deal.

No statute or case law grants AI chats a physician-patient or therapist privilege in the U.S., so a court can compel OpenAI to disclose stored transcripts. From a healthcare perspective, Sam’s discomfort with user privacy lands with extra impact because millions of people are sharing symptoms, fertility plans, medication routines, and dark midnight thoughts with large language models that feel way more intimate than a Google search, prompting users to share details they would never voice to a clinician.

Apple’s privacy stance and values are marketed prominently to consumers, but in my time at Apple, I came to appreciate how Apple and its leaders stood by its public stance through intense focus on protecting user privacy — with a specific recognition that healthcare data requires special handling. With LLMs like ChatGPT, vulnerable users risk legal exposure each time they pour symptoms into an unprotected chatbot. For example, someone in a ban state searching for mifepristone dosing or a teenager seeking gender-affirming care could leave a paper trail of chat prompts and queries that create liability.

The free consumer ChatGPT operates much like “Dr. Google” today in the health context. Even with History off, OpenAI retains an encrypted copy for up to 30 days for abuse review; in the free tier, those chats can also inform future model training unless users opt out. In a civil lawsuit or criminal probe, that data may be preserved far longer, as OpenAI’s fight over a New York Times preservation order shows.

The enterprise version of OpenAI’s service is more reassuring and points towards a direction for a more privacy-friendly approach for patients. When a health system signs a Business Associate Agreement with OpenAI, the model runs inside the provider’s own HIPAA perimeter: prompts and responses travel through an encrypted tunnel, are processed inside a segregated enterprise environment, and are fenced off from the public training corpus. Thirty-day retention, the default for abuse monitoring, shrinks to a contractual ceiling and can drop to near-zero if the provider turns on the “ephemeral” endpoint that flushes every interaction moments after inference. Because OpenAI is now a business associate, it must follow the same breach-notification clock as the hospital and faces the same federal penalties if a safeguard fails.

In practical terms the patient gains three advantages. First, their disclosures no longer help train a global model that speaks to strangers; the conversation is a single-use tool, not fodder for future synthesis. Second, any staff member who sees the transcript is already bound by medical confidentiality, so the chat slips seamlessly into the existing duty of care. Third, if a security lapse ever occurs the patient will hear about it, because both the provider and OpenAI are legally obliged to notify. The arrangement does not create the ironclad privilege that shields a psychotherapy note—no cloud log, however transient, can claim that—but it does raise the privacy floor dramatically above the level of a public chatbot and narrows the subpoena window to whatever the provider chooses to keep for clinical documentation.

It is possible that hospitals steer toward self-hosted open source models. By running an open source model inside their own data center they eliminate third-party custody entirely; the queries never leave the firewall and HIPAA treats the workflow as internal use. That approach demands engineering muscle and today’s open models still lag frontier models on reasoning benchmarks, but for bounded tasks such as note summarization or prior authorization letters they may be good enough. Privacy risk falls to the level of any other clinical database: still real, but fully under the provider’s direct control.

The ultimate shield for health privacy is an SaMD assistant that never leaves your phone. Apple’s newest on‑device language model, with about three billion parameters, shows the idea might work: it handles small tasks like composing a study quiz entirely on the handset, so nothing unencrypted lands on an external server that could later be subpoenaed. The catch is scale. Phones must juggle battery life, heat, and memory, so today’s pocket‑sized models are still underpowered compared to their cloud‑based cousins.

Over the next few product cycles two changes should narrow that gap. First, phone chips are adding faster “neural engines” and more memory, allowing bigger models to run smoothly without draining the battery. Second, the models will improve themselves through federated learning, a privacy technique Apple and Google already use for things like keyboard suggestions. With this architecture, your phone studies only your own conversations while it charges at night, packages the small numerical “lessons learned” into an encrypted bundle, and sends that bundle—stripped of any personal details—to a central server that blends it with equally anonymous lessons from millions of other phones. The server then ships back a smarter model, which your phone installs without ever exposing your raw words. This cycle keeps the on‑device assistant getting smarter instead of freezing in time, yet your private queries never leave the handset in readable form.

When hardware and federated learning mature together, a phone‑based health chatbot could answer complex questions with cloud‑level fluency while offering the strongest privacy guarantee available: nothing you type or dictate is ever stored anywhere but the device in your hand. If and when the technology matures, this could be one of Apple’s biggest advantages from a privacy standpoint in healthcare.

For decades “Dr. Google” meant we bartered privacy in exchange for convenience. Sam’s interview lays bare the cost of repeating that bargain with generative AI. Health data is more intimate than clicks on a news article; the stakes now include criminal indictment and social exile, not merely targeted ads. Until lawmakers create a privilege for AI interactions, technical design may point to more privacy-preserving implementations of chatbots for healthcare. Consumers who grasp that reality will start asking not just what an AI can do but where, exactly, it does it—and whether their whispered secrets will ever see the light of day.

Featured

What I’ve Learned About LLMs in Healthcare (so far)

It has been a breathless time in technology since the GPT-3 moment, and I’m not sure I have experienced greater discordance between the hype and reality than right now, at least as it relates to healthcare. To be sure, I have caught myself agape in awe at what LLMs seem capable of, but in the last year, it has become ever more clear to me what the limitations are today and how far away we are from all “white collar jobs” in healthcare going away.

Microsoft had an impressive announcement last week with The Path to Medical Super-Intelligence with its claim that its AI Diagnostic Orchestrator (MAI-DxO) correctly diagnosed up to 85% of NEJM case proceedings, a rate more than four times higher than a group of experienced physicians (~20% accuracy). While this is an interesting headline result, I think we are still far from “medical superintelligence”, and in some ways, we underestimate what human intelligence is good at it, particularly in the healthcare context.

Beyond potential issues of benchmark contamination, the data for Microsoft’s evaluation of its orchestrator agent is based on NEJM case records that are highly curated, teaching narrative summaries. Compare that to a real hospital chart: a decade of encounters scattered across medication tables, flowsheets, radiology blobs, scanned faxes, and free-text notes written in three different EHR versions. In that environment, LLMs lose track of units, invent past medical history, and offer confident plans that collapse under audit. Two Epic pilot reports—one from Children’s Hospital of Philadelphia, the other from a hospital in Belgium—show precisely this gap and shortcoming with LLMs. Both projects needed dozens of bespoke data pipelines just to assemble a usable prompt, and both catalogued hallucinations whenever a single field went missing.

The disparity is unavoidable: artificial general intelligence measured on sanitized inputs is not yet proof of medical superintelligence. The missing ingredient is not reasoning power; it is reliable, coherent context.


Messy data still beats massive models in healthcare

Transformer models process text through a fixed-size context window, and they allocate relevance by self-attention—the internal mechanism that decides which tokens to “look at” when generating the next token. GPT-3 gave us roughly two thousand tokens; GPT-4 stretches to thirty-two thousand; experimental systems boast six-figure limits. That may sound limitless, yet the engineering reality is stark: packing an entire EHR extract or a hundred-page protocol into a prompt does not guarantee an accurate answer. Empirical work—including Nelson Liu’s “lost-in-the-middle” study—shows that as the window expands, the model’s self-attention diffuses. With every additional token, attention weight is spread thinner, positional encodings drift, and the transformer’s gradient now competes with a larger field of irrelevant noise. Beyond a certain length the network begins to privilege recency and surface phrase salience, systematically overlooking material introduced many thousands of tokens earlier.

In practical terms, that means a sodium of 128 mmol/L taken yesterday and a potassium of 2.9 mmol/L drawn later that same shift can coexist in the prompt, yet the model cites only the sodium while pronouncing electrolytes ‘normal. It is not malicious; its attention budget is already diluted across thousands of tokens, leaving too little weight to align those two sparsely related facts. The same dilution bleeds into coherence: an LLM generates output one token at a time, with no true long-term state beyond the prompt it was handed. As the conversation or document grows, internal history becomes approximate. Contradictions creep in, and the model can lose track of its own earlier statements.

Starved of a decisive piece of context—or overwhelmed by too much—today’s models do what they are trained to do: they fill gaps with plausible sequences learned from Internet-scale data. Hallucination is therefore not an anomaly but a statistical default in the face of ambiguity. When that ambiguity is clinical, the stakes escalate. Fabricating an ICD-10 code or mis-assigning a trial-eligibility criterion isn’t a grammar mistake; it propagates downstream into safety events and protocol deviations.

Even state-of-the-art models fall short on domain depth. Unless they are tuned on biomedical corpora, they handle passages like “EGFR < 30 mL/min/1.73 m² at baseline” as opaque jargon, not as a hard stop for nephrotoxic therapy. Clinicians rely on long-tail vocabulary, nested negations, and implicit timelines (“no steroid in the last six weeks”) that a general-purpose language model never learned to weight correctly. When the vocabulary set is larger than the context window can hold—think ICD-10 or SNOMED lists—developers resort to partial look-ups, which in turn bias the generation toward whichever subset made it into the prompt.

Finally, there is the optimization bias introduced by reinforcement learning from human feedback. Models rewarded for sounding confident eventually prioritize tone that sounds authoritative even when confidence should be low. In an overloaded prompt with uneven coverage, the safest behavior would be to ask for clarification. The objective function, however, nudges the network to deliver a fluent answer, even if that means guessing. In production logs from the CHOP pilot you can watch the pattern: the system misreads a missing LOINC code as “value unknown” and still generates a therapeutic recommendation that passes a surface plausibility check until a human spots the inconsistency.

All of these shortcomings collide with healthcare’s data realities. An encounter-centric EHR traps labs in one schema and historical notes in another; PDFs of external reports bypass structured capture entirely. Latency pressures push architects toward caching, so the LLM often reasons on yesterday’s snapshot while the patient’s creatinine is climbing. Strict output schemas such as FHIR or USDM leave zero room for approximation, magnifying any upstream omission. The outcome is predictable: transformer scale alone cannot rescue performance when the context is fragmented, stale, or under-specified. Before “superintelligent” agents can be trusted, the raw inputs have to be re-engineered into something the model can actually parse—and refuse when it cannot.


Context engineering is the job in healthcare

Andrej Karpathy really nailed it here:

Context engineering answers one question: How do we guarantee the model sees exactly the data it needs, in a form it can digest, at the moment it’s asked to reason?

In healthcare, I believe that context engineering will require three moves to align the data to ever-more sophisticated models.

First, selective retrieval. We replace “dump the chart” with a targeted query layer. A lipid-panel request surfaces only the last three LDL, HDL, total-cholesterol observations—each with value, unit, reference range, and draw time. CHOP’s QA logs showed a near-50 percent drop in hallucinated values the moment they switched from bulk export to this precision pull.

Second, hierarchical summarisation. Small, domain-tuned models condense labs, meds, vitals, imaging, and unstructured notes into crisp abstracts. The large model reasons over those digests, not 50,000 raw tokens. Token budgets shrink, latency falls, and Liu’s “lost-in-the-middle” failure goes quiet because the middle has been compressed away.

Third, schema-aware validation—and enforced humility. Every JSON bundle travels through the same validator a human would run. Malformed output fails fast. Missing context triggers an explicit refusal.


AI agents in healthcare up the stakes for context

The next generation of clinical applications will not be chatbots that answer a single prompt and hand control back to a human. They are agents—autonomous processes that chain together retrieval, reasoning, and structured actions. A typical pipeline begins by gathering data from the EHR, continues by invoking clinical rules or statistical models, and ends by writing back orders, tasks, or alerts. Every link in that chain inherits the assumptions of the link before it, so any gap or distortion in the initial context is propagated—often magnified—through every downstream step.

Consider what must be true before an agent can issue something as simple as an early-warning alert:

  • All source data required by the scoring algorithm—vital signs, laboratory values, nursing assessments—has to be present, typed, and time-stamped. Missing a single valueQuantity.unit or ingesting duplicate observations with mismatched timestamps silently corrupts the score.
  • The retrieval layer must reconcile competing records. EHRs often contain overlapping vitals from bedside monitors and manual entry; the agent needs deterministic fusion logic to decide which reading is authoritative, otherwise it optimizes on the wrong baseline.
  • Every intermediate calculation must preserve provenance. If the agent writes a structured CommunicationRequest back to the chart, each field should carry a pointer to its source FHIR resource, so a clinician can audit the derivation path in one click.
  • Freshness guarantees matter as much as completeness. The agent must either block on new data that is still in transit (for example, a troponin that posts every sixty minutes) or explicitly tag the alert with a “last-updated” horizon. A stale snapshot that looks authoritative is more dangerous than no alert at all.

When those contracts are enforced, the agent behaves like a cautious junior resident: it refuses to proceed when context is incomplete, cites its sources, and surfaces uncertainty in plain text. When any layer is skipped—when retrieval is lossy, fusion is heuristic, or validation is lenient—the agent becomes an automated error amplifier. The resulting output can be fluent, neatly formatted, even schema-valid, yet still wrong in a way that only reveals itself once it has touched scheduling queues, nursing workflows, or medication orders.

This sensitivity to upstream fidelity is why context engineering is not a peripheral optimization but the gating factor for autonomous triage, care-gap closure, protocol digitization, and every other agentic use case to come. Retrieval contracts, freshness SLAs, schema-aware decoders, provenance tags, and calibrated uncertainty heads are the software equivalents of sterile technique; without them, scaling the “intelligence” layer merely accelerates the rate at which bad context turns into bad decisions.


Humans still have a lot to teach machines

While AI can be brilliant for some use cases, in healthcare so far, large-language models still seem like brilliant interns: tireless, fluent, occasionally dazzling—and constitutionally incapable of running the project alone. A clinician opens a chart and, in seconds, spots that an ostensibly “normal” electrolyte panel hides a potassium of 2.8 mmol/L. A protocol digitizer reviewing a 100-page oncology protocol instinctively flags that the run-in period must precede randomization, even though the document buries the detail in an appendix.

These behaviors look mundane until you watch a vanilla transformer miss every one of them. Current models do not plan hierarchically, do not wield external tools unless you bolt them on, and do not admit confusion; they generate tokens until the temperature hits zero. Until we see another major AI innovation like the transformer models themselves, healthcare needs a viable scaffolding that lets an agentic pipeline inherit the basic safety reflexes clinicians exercise every day.

That is not a defeatist conclusion; it is a roadmap. Give the model pipelines that keep the record complete, current, traceable, schema-tight, and honest about uncertainty, and its raw reasoning becomes both spectacular and safe. Skip those safeguards and even a 100-k-token window will still hallucinate a drug dose out of thin air. When those infrastructures become first-class, “superintelligence” will finally have something solid to stand on.