The evaluator assesses how well the RAG pipeline responds to questions, providing a score and reasons for low scores.
An LLM as a judge was chosen for its ease of building and deployment, over classic NLP metrics which often require ground truth answers.
The speaker chose a ground-truth-based LLM judge, using 20 example questions, each with three expected facts in the answer.
The evaluator checks if these facts are present in the RAG-generated answer, providing a pass/fail and a reason, culminating in a numerical score (total correct facts out of 60).
The agent's role is to optimize the prompt by researching online prompting guides and combining them with failure reasons from the evaluator.
It runs the evaluator to get an initial score, then iteratively generates new prompts based on feedback, feeding them back to the evaluator.
This process is compared to classic machine learning training, specifically gradient ascent.
Using CrewAI, the agent started with an initial score of 0.44 and, after two iterations, achieved a score of 0.9, meaning 90% of expected facts were correct.
The agent successfully generated a detailed, optimized prompt without manual prompt engineering.