Friday, April 17, 2026

The Architecture of Accuracy: Why Parallel Task Decomposition is the New Gold Standard for Clinical AI

In the high-stakes corridors of modern healthcare—from the private suites of Mount Elizabeth to the bustling wards of Singapore General Hospital—time is the only currency that truly matters. This briefing explores a paradigm shift in clinical AI: moving away from slow, iterative correction loops toward a parallel, decomposed architecture. By narrowing the cognitive context of individual AI agents, developers are achieving a 5x increase in speed while simultaneously hardening clinical accuracy. It is a lesson in "context engineering" that suggests the best way to fix a complex AI problem is not to ask the model to try harder, but to ask it to do less.


A walk through the Singapore Central Business District at 7:00 PM reveals a telling tableau. Look through the glass of the multi-disciplinary clinics in Raffles Place, and you will see clinicians hunched over glowing screens, long after the last patient has departed. This is "pajama time"—the administrative debt of a day spent in service to others. In the Singaporean context, where the Ministry of Health is aggressively pushing the Smart Nation initiative to combat an ageing population and a tightening healthcare labour market, the "pajama time" problem isn't just a nuisance; it is a systemic risk to the sustainability of the national healthcare model.

Artificial Intelligence was promised as the antidote to this administrative malaise. Yet, for many early adopters, the cure has proven nearly as taxing as the disease. When a clinical AI system fabricates a medication dosage or misattributes a patient’s symptom, the cost is measured in the most precious resource a doctor has: cognitive bandwidth. If a physician must audit an AI-generated note line-by-line, the system has failed. Speed is a requirement, but accuracy is the non-negotiable floor.

The industry is currently witnessing a silent revolution in how these systems are built. The "monolithic" approach—one giant model processing one giant transcript—is being dismantled in favour of a sophisticated, modular architecture. It is a shift from brute force to elegant decomposition.

The Fallacy of the Refinement Loop

For the past two years, the standard "best practice" for high-accuracy AI followed a predictable pattern: Generate, Judge, and Refine. You ask a frontier model to write a clinical note, you have a second model (the "Judge") check it for errors, and if it finds any, a third pass (the "Refinement") attempts a fix.

On paper, this sounds like a robust safety net. In practice, it is a serial bottleneck that often compounds the very errors it seeks to erase.

Our internal data revealed a sobering reality. In a controlled ablation study, we removed this iterative loop entirely. Predictably, quality dropped by 11% across clinical dimensions. Specifically, clinical safety scores plummeted. The loop was clearly "working," but at a devastating cost. Each correction cycle added 10 to 15 seconds of serial computation. In a high-volume clinic in Jurong or Bedok, where a GP might see forty patients a day, a 60-second delay for a note to appear is an eternity.

More concerning was the "refinement tax." We analysed fifty production traces and found that while the refinement agent resolved roughly 45% of identified violations, it introduced new errors at nearly the same rate. A missing ICD-10 code might get fixed, but in the process, the model might hallucinate a dosage for a medication the patient never mentioned. Only 8% of traces were fully resolved after refinement; 39% showed no improvement at all.

Asking a model to correct its own complex output is like asking a tired intern to proofread their own thesis at 3:00 AM—they are likely to miss the same nuances they missed the first time. The problem wasn't the quality of the "Judge"; it was the fact that the first pass was fundamentally unreliable.

The Cognitive Load of the Monolithic Agent

Why is the first pass so often weak? To understand this, we must look at the "attention" mechanisms of Large Language Models (LLMs) through a design lens.

When a single agent generates a comprehensive clinical note, it isn't performing one task. It is performing at least six competing cognitive tasks simultaneously:

  1. Diarisation & Parsing: Identifying who spoke (doctor vs. patient) in a messy, unstructured transcript.

  2. Information Routing: Deciding if a mention of "chest pain" belongs in the History of Present Illness (HPI) or the Review of Systems (ROS).

  3. Rule Application: Remembering that an orthopaedic note requires specific surgical site details that a psychiatric note does not.

  4. Specialty Logic: Applying the nuanced documentation standards of, say, the Singapore Medical Council for specific procedures.

  5. Hallucination Suppression: Constantly cross-referencing every claim against the raw transcript.

  6. Schema Adherence: Ensuring the final output fits a rigid, multi-key JSON structure for EHR integration.

Recent research supports what we observed: instruction-following accuracy drops from 92% at 200 tokens to a dismal 60% when instructions reach 4,000 tokens (Gupta et al., 2025). Even the most advanced models struggle when asked to follow more than 500 simultaneous instructions.

In Singapore’s context, where bilingualism is common and patients often switch between English and "Singlish," the parsing task alone is significant. When you layer the complex administrative requirements of our local Integrated Healthcare Information Systems (Synapxe) on top of that, the monolithic model simply "redlines." It is overloaded.

Task Decomposition: The "Specialist" Architecture

The solution we pioneered is a move toward "Context Engineering." If context is a finite resource with diminishing marginal returns, the most effective lever is to narrow the window.

Instead of one agent writing fifteen sections of a note, we deploy a fleet of parallel "Specialist Agents." Each agent is assigned a radically narrower slice of the document. An HPI Agent sees only the HPI instructions and the relevant transcript sections. An Assessment & Plan Agent focuses solely on the clinical reasoning and coding.

What Changes in the Context?

By decomposing the task, we alter the fundamental physics of the model's attention:

  • Shared Context: Every agent receives the raw transcript and core safety rules.

  • Focused Context: Each agent receives only the instructions relevant to its specific section and a simplified output schema.

The results are transformative. By reducing the task complexity per agent by 5x to 7x, the "first pass" accuracy becomes reliable enough to make the iterative loop obsolete. In our production environment—which has now processed over 100,000 notes—latency dropped from a p50 of 37 seconds to just 7.5 seconds. For shorter consultations, it often falls under 5 seconds.

Crucially, this speed is a structural consequence of being accurate. Because the agents are focused, they don't hallucinate as often. Because they run in parallel, the total wall-clock time is only as long as the slowest individual agent (usually the complex Assessment & Plan section).

The Singapore Lens: Efficiency as a National Imperative

For Singapore’s healthcare ecosystem, this architectural shift is timely. The nation is currently undergoing a "Healthier SG" transformation, moving toward preventive care and long-term population health management. This requires more documentation, not less.

If we rely on monolithic AI architectures that are slow and prone to "drift," we risk alienating the very clinicians we need to empower. A GP at a SingHealth polyclinic cannot wait 40 seconds for an AI to "think." They need the documentation to be there the moment they finish the consultation.

Furthermore, the decomposed architecture allows for better "Localisation." Singapore has unique clinical coding requirements and specific medication brand names (e.g., Panadol rather than Tylenol). In a decomposed system, we can update the "Medication Specialist Agent" with local pharmaceutical databases without needing to retrain or re-prompt the entire system. It makes the AI as adaptable as the city-state itself.

The Rise of Smaller, Greener Models

An unexpected benefit of decomposition is that it changes the economics of AI. In a monolithic system, you almost always need the largest, most expensive "frontier" models (like GPT-4o or Claude 3.5 Sonnet) to handle the complexity.

In a decomposed system, "Small Language Models" (SLMs) suddenly become viable. A highly fine-tuned 8B or 14B parameter model can perform focused extraction (like pulling out symptoms) with the same accuracy as a giant model, but at a fraction of the cost and energy. In a world increasingly concerned with the carbon footprint of data centres—a key consideration for Singapore’s Green Plan 2030—this "Small Agent" strategy is both economically and ethically superior.

Navigating the Trade-offs

This is not to say that decomposition is a silver bullet. It introduces new challenges that require a sophisticated engineering hand:

  1. The Token Tax: Running multiple agents means the same transcript is processed several times. This increases the total token count. However, since the calls are parallel and can often use cheaper models, the "latency-to-value" ratio remains heavily in favour of decomposition.

  2. Cross-Section Coherence: When agents work in isolation, there is a risk that the "History" section might slightly contradict the "Plan" section. We solve this with a single, lightweight "QA Agent" that reviews the final assembled note. It is one pass, not an iterative loop.

  3. Prompt Infrastructure: In this new world, prompts are not just text files; they are load-bearing clinical infrastructure. We have had to build versioned, composable prompt blocks that can be updated for different medical specialties (e.g., Cardiology vs. Paediatrics) instantly.

Conclusion & Practical Takeaways

The transition from iterative correction to parallel decomposition marks the end of the "experimental" phase of clinical AI and the beginning of its "industrial" phase. We have learned that if you want a machine to be reliable, you must give it a clear, narrow field of vision.

For technology leaders and clinical directors in Singapore and beyond, the path forward is clear:

  • Audit Your Latency: If your AI takes more than 15 seconds to return a note, you likely have a serial dependency problem.

  • Decompose Complexity: Identify the most error-prone sections of your output and isolate them into dedicated specialist agents.

  • Focus on Context Engineering: Stop trying to find a "smarter" model; instead, find a way to give your current model a "smaller" task.

  • Invest in Prompt Versioning: Treat your agent instructions with the same rigour as your source code.

  • Prioritize First-Pass Accuracy: A system that requires a "Judge" to be safe is a system that isn't ready for the clinical front lines.

The future of the Smart Nation's healthcare doesn't depend on AI doing more; it depends on AI doing fewer things, much better.


Frequently Asked Questions

1. Does using multiple agents increase the cost of AI generation?

Yes, in terms of raw token usage, decomposition is more "expensive" because the transcript is processed by multiple agents. However, because these agents can often be smaller, more efficient models, and because the reduction in "pajama time" for clinicians provides a massive ROI, the net economic benefit is significantly positive.

2. How do you ensure the note sounds like it was written by one person if different agents write different parts?

We use a final "Stylistic Harmoniser" or a "QA Agent" that reviews the assembled document. This agent doesn't re-write the clinical facts but ensures consistent terminology and tone, maintaining the "voice" of the clinician while preserving the accuracy of the specialists.

3. Is this architecture only useful for medical documentation?

Not at all. The principle of "Context Engineering over Iteration" applies to any complex, structured generation task—such as legal contract review, multi-file code generation, or comprehensive financial reporting. Wherever an output has independently addressable sections and the input is long, decomposition is the superior strategy.

No comments:

Post a Comment