SUMM

Modern AI agents operate in dynamic, unpredictable environments that differ from traditional software systems.
These agents face multiple failure modes simultaneously, including hallucinations, retrieval failures, and reasoning errors, which are often interconnected.
Quotient AI monitors live AI agents, enabling detection of objective system failures without waiting for ground truth data or benchmarks.

Tavily provides real-time web data integration for AI systems, supporting use cases like legal AI assistants, live sports updates, and credit card fraud detection.
Evaluation must accommodate rapidly changing data and subjective truths dependent on timing, sources, and user needs.
Evaluation methods should strive for fairness and minimize bias, as correctness can be contextual.

Static datasets like Simple QA (single-answer fact questions) and Hotspot QA (multi-hop reasoning) are common for offline evaluation.
Static benchmarks are limited when dealing with real-time systems and evolving information, as they do not capture subjectivity or the lack of a single truth.
Dynamic datasets, regularly refreshed with real-world data, ensure broader coverage and continuous relevancy for evaluating AI retrieval-augmented generation (RAG) systems.

An open-source agent was developed to create dynamic evaluation sets for web-based RAG systems, leveraging the LangGraph framework.
The process involves generating broad web queries for targeted domains, aggregating grounding documents from multiple real-time AI search providers, and creating evidence-based question-answer pairs with source traceability.
Evaluation experiments are tracked using LangSmith for observability and reproducibility.

The team aims to support a wider range of question types (from simple to multi-hop) and proactively address bias by ensuring fairness and broad coverage.
Plans include adding supervisor nodes for coordination in multi-agent architectures to enhance the quality of generated data.
Evaluation frameworks should measure accuracy, source diversity, source relevancy, and hallucination rates, with unsupervised methods to scale evaluations and address subjectivity.

An experiment was conducted on six AI search providers using both static (Simple QA) and dynamic benchmarks covering similar topic distributions.
Correctness scores on dynamic benchmarks were significantly lower and provider rankings shifted, highlighting that static benchmarks are not comprehensive.
Issues were identified where evaluation metrics did not fully capture accuracy or hallucinations in model responses.

Three reference-free metrics were used: answer completeness, document relevance, and hallucination detection.
Answer completeness closely correlated with overall provider performance.
Only three providers returned the full grounding documents, limiting broader applicability of document relevance and hallucination metrics.
Findings showed a strong inverse correlation between document relevance and unknown answers; more relevant documents reduced the rate of the model saying “I don’t know.”
Unexpectedly, higher document relevance sometimes corresponded to higher hallucination rates, suggesting a trade-off between answer completeness and hallucination risk.

Depending on the application, different metrics may be prioritized, as each measures a distinct dimension of response quality.
Jointly analyzing these metrics helps diagnose issues (e.g., incomplete answers with relevant documents may suggest insufficient retrieval) and suggests targeted improvements.
Effective evaluation should extend beyond rankings to actionable strategies for improving system performance.

The ultimate goal is self-improving AI agents that learn from usage patterns, adapt to outdated or unreliable information, and proactively correct hallucinations during interactions, all without human intervention.
Dynamic datasets, holistic evaluation, and reference-free metrics are foundational steps toward achieving robust, continuously improving augmented AI systems.

Evaluating AI Search: A Practical Framework for Augmented AI Systems — Quotient AI + Tavily