Prompt Engineering is Dead — Nir Gazit, Traceloop

The Problem with Prompt Engineering 00:00

  • The speaker argues that "prompt engineering is dead" and never truly existed, as it often involves manually trying to make LLMs behave.
  • A personal story is shared about improving a RAG-based chatbot by 5x without traditional prompt engineering.
  • The chatbot initially struggled with answering only Traceloop-related questions, being useful, and avoiding mistakes.
  • The speaker's goal was to automate the improvement process rather than manually iterating on prompts.

Building an Auto-Improving System 01:55

  • The vision is an "automatically improving machine" or agent that researches and applies prompt engineering techniques.
  • This system requires an evaluator to assess improvements and a dataset of questions for the chatbot.
  • The overall architecture consists of a RAG pipeline, an evaluator, and an auto-improving agent.

RAG Pipeline Overview 02:55

  • The RAG pipeline is described as simple, using a Chroma database and OpenAI, to find relevant documents and generate answers.
  • A demonstration shows the pipeline answering a question and displaying its internal trace, including calls to OpenAI and the Chroma database.

The Evaluator Component 04:06

  • The evaluator assesses how well the RAG pipeline responds to questions, providing a score and reasons for low scores.
  • An LLM as a judge was chosen for its ease of building and deployment, over classic NLP metrics which often require ground truth answers.
  • The speaker chose a ground-truth-based LLM judge, using 20 example questions, each with three expected facts in the answer.
  • The evaluator checks if these facts are present in the RAG-generated answer, providing a pass/fail and a reason, culminating in a numerical score (total correct facts out of 60).

The Auto-Improving Agent in Practice 09:08

  • The agent's role is to optimize the prompt by researching online prompting guides and combining them with failure reasons from the evaluator.
  • It runs the evaluator to get an initial score, then iteratively generates new prompts based on feedback, feeding them back to the evaluator.
  • This process is compared to classic machine learning training, specifically gradient ascent.
  • Using CrewAI, the agent started with an initial score of 0.44 and, after two iterations, achieved a score of 0.9, meaning 90% of expected facts were correct.
  • The agent successfully generated a detailed, optimized prompt without manual prompt engineering.

Future Considerations and Conclusion 11:55

  • A concern about overfitting was raised, as using only 20 examples for optimization might make the prompt specific to those examples.
  • The ideal solution would be more examples, split into training, testing, and evaluation sets, similar to classic machine learning.
  • The speaker acknowledges the irony of stating "prompt engineering is dead" while having to prompt engineer the agent itself.
  • A future idea is to have the agent optimize its own prompts (for the evaluator or itself).
  • The Traceloop/autoprompting-demo repository is provided for users to try out the system.