SUMM

Christopher Lovejoy introduces himself as a medical doctor turned AI engineer with experience in building AI systems for healthcare.
He outlines his work at Anterior, a clinician-led company providing clinical reasoning tools to automate and accelerate health insurance and healthcare administration for providers covering around 50 million lives in the US.
The focus is on building domain-native LLM applications, where the system that incorporates domain insights is more critical than the underlying model or pipeline sophistication.

Successfully applying large language models (LLMs) in specialized industries faces the “last mile problem”: enabling the AI to understand domain-specific context and workflows.
An example from healthcare shows the complexity of determining what constitutes "unsuccessful conservative therapy for at least six weeks" for a knee operation, illustrating ambiguities and nuances in clinical reasoning.
The challenge lies in the model’s ability to embed subtle, situational knowledge rather than just raw performance.

Achieving a high baseline accuracy (around 95%) is feasible with strong models, but further gains require system-level domain insight integration.
Through iterations focused on incorporating domain insights, the system at Anterior improved performance to about 99%.

The approach centers on an adaptive domain intelligence engine to translate customer-specific domain insights into measurable performance improvements.
The process divides into two main parts: Measurement (current pipeline performance) and Improvement (iterative enhancement).

The first measurement step is defining the key metrics users care about (e.g., minimizing false approvals in medical reviews).
Collaborate with domain experts and customers to distill the most vital metrics (often just one or two per domain).
Develop a failure mode ontology to categorize types of AI errors (e.g., medical record extraction, clinical reasoning, rules interpretation).
Domain experts should lead in evaluating AI outputs, identifying whether results are correct and labeling failure modes, often using a bespoke dashboard.
This dual approach (metrics and failure mode labeling) helps prioritize improvements most impactful to core metrics.

Failure mode labeling creates production-derived datasets for targeted iteration, enabling engineers to focus on the primary causes of incorrect outcomes.
Pipeline versions are tracked for performance against each failure mode, ensuring targeted improvements and preventing regressions.
Tooling empowers domain experts (non-technical users) to suggest changes or add new domain knowledge, directly informing application pipelines.
Data-driven evaluations assess the impact of these suggestions, with fast feedback loops—often corrections can go live within a day when validated.
This cycle results in domain expert reviews generating performance metrics, defining failure modes, and suggesting improvements simultaneously.

The level of domain expertise required varies by use case; for complex clinical reasoning, experienced doctors are ideal, but simpler tasks may need only junior clinical staff.
Tooling for the domain intelligence engine is typically custom-built for tight integration with platforms.
Initially, in-house domain experts perform reviews and iteration, but there is potential for customer organizations to use these tools themselves for validation.

Production applications generate AI outputs, which domain experts review to provide insights on metrics and failure modes.
A domain expert PM uses this feedback to prioritize improvement tasks for engineers, specifying target thresholds.
Engineers iterate using ready-made failure mode datasets, updating the PM with results for go-live decisions.
The process ensures a self-improving, data-driven system managed by a domain expert PM.
Final takeaway: solving the last mile in domain LLM applications is about embedding adaptive domain intelligence and leveraging domain experts for ongoing improvement, not just improving model sophistication.

Lovejoy recommends further reading on his website and invites contact for those interested in vertical AI applications, evaluations, or roles at Anterior.

Make your LLM app a Domain Expert: How to Build an Expert System — Christopher Lovejoy, Anterior