SUMM

The speaker argues that "prompt engineering is dead" and never truly existed, as it often involves manually trying to make LLMs behave.
A personal story is shared about improving a RAG-based chatbot by 5x without traditional prompt engineering.
The chatbot initially struggled with answering only Traceloop-related questions, being useful, and avoiding mistakes.
The speaker's goal was to automate the improvement process rather than manually iterating on prompts.

The vision is an "automatically improving machine" or agent that researches and applies prompt engineering techniques.
This system requires an evaluator to assess improvements and a dataset of questions for the chatbot.
The overall architecture consists of a RAG pipeline, an evaluator, and an auto-improving agent.

The RAG pipeline is described as simple, using a Chroma database and OpenAI, to find relevant documents and generate answers.
A demonstration shows the pipeline answering a question and displaying its internal trace, including calls to OpenAI and the Chroma database.

The evaluator assesses how well the RAG pipeline responds to questions, providing a score and reasons for low scores.
An LLM as a judge was chosen for its ease of building and deployment, over classic NLP metrics which often require ground truth answers.
The speaker chose a ground-truth-based LLM judge, using 20 example questions, each with three expected facts in the answer.
The evaluator checks if these facts are present in the RAG-generated answer, providing a pass/fail and a reason, culminating in a numerical score (total correct facts out of 60).

The agent's role is to optimize the prompt by researching online prompting guides and combining them with failure reasons from the evaluator.
It runs the evaluator to get an initial score, then iteratively generates new prompts based on feedback, feeding them back to the evaluator.
This process is compared to classic machine learning training, specifically gradient ascent.
Using CrewAI, the agent started with an initial score of 0.44 and, after two iterations, achieved a score of 0.9, meaning 90% of expected facts were correct.
The agent successfully generated a detailed, optimized prompt without manual prompt engineering.

A concern about overfitting was raised, as using only 20 examples for optimization might make the prompt specific to those examples.
The ideal solution would be more examples, split into training, testing, and evaluation sets, similar to classic machine learning.
The speaker acknowledges the irony of stating "prompt engineering is dead" while having to prompt engineer the agent itself.
A future idea is to have the agent optimize its own prompts (for the evaluator or itself).
The Traceloop/autoprompting-demo repository is provided for users to try out the system.

Prompt Engineering is Dead — Nir Gazit, Traceloop