Prompt Engineering and AI Red Teaming — Sander Schulhoff, HackAPrompt/LearnPrompting

Introduction and Background 00:00

  • Sander Schulhoff introduces himself as CEO of Learn Prompting and HackAPrompt, with a background in AI research, NLP, and deep reinforcement learning.
  • Early involvement in prompt engineering, writing the first internet guide, and expanding into prompt injection and AI security.
  • Organized the first prompt injection/AI red teaming competition, leading to the creation of a 600,000-prompt dataset now widely used for benchmarking.
  • Goals for the session: explain why prompt engineering remains relevant, discuss security deployments, and highlight the challenges of securing generative AI.

Story and Context: Path to Prompt Engineering 04:09

  • Gained initial experience through AI deception research in the Diplomacy board game, which later tied into relevance for modern AI systems.
  • Contributed to the MinRL (Minecraft RL) project—connecting reinforcement learning research to the emerging trend of AI “agents."
  • Created Learn Prompting as a college project, scaling it into a major resource cited by OpenAI, Google, BCG, US Gov, and others.

Fundamentals of Prompt Engineering 09:16

  • Definition: A "prompt" is simply a message sent to a generative AI; prompt engineering is the process of improving that prompt for better results.
  • Prompt engineering can increase AI task accuracy significantly, but poorly crafted prompts can drop accuracy to zero.
  • Prompting, as a concept, goes back years under various names (e.g., control codes), but "prompt engineering" only became a widely used term around 2021.
  • Two main types of prompt engineering users: non-technical (iterative, conversational mode with chatbots) and technical (static, system-level prompts for tasks).

Systematic Literature Review and Techniques 17:26

  • Schulhoff led a large-scale literature review (the “prompt report”) cataloguing 200 prompting techniques, including about 58 text-based ones.
  • Defined key prompt parts (e.g., role, examples) and clarified which components are most effective across real-world usages.
  • Role prompting (assigning AI a "role") was thought to improve accuracy (e.g., "math professor" for solving math), but evidence showed it's largely ineffective for accuracy-based tasks and more urban myth than fact. For open-ended tasks like writing, it can still help.

Advanced Prompting Techniques 29:26

  • Thought Inducement: Chain of thought prompting, where AI is instructed to show step-by-step reasoning, is vital for accuracy and inspired reasoning-model development. The AI’s "explanations" do not always reflect its internal process but improve outcomes.
  • Decomposition-Based Prompting: Techniques like least-to-most prompting split complex problems into solvable subproblems.
  • Ensembling: Using multiple prompts or models to reach a consensus answer; less used today.
  • In-Context Learning and Few-Shot Prompting: Providing task examples during prompt creation remains a cornerstone technique. The number, order, balance, and quality of examples can majorly impact performance, but optimal parameters are highly variable and often trial-and-error.
  • Prompt performance can fluctuate based on model fine-tuning and even prompt mining—choosing prompt styles that match model training data yields better results.

Practical Challenges and Open Questions in Prompting 43:50

  • Exemplar ordering (order of examples) can shift accuracy dramatically; the field lacks consensus on optimal organization.
  • Balancing labels (class distribution) and checking quality is as important as in classical ML but subject to unusual quirks in LLMs.
  • Similarity between prompt examples and target instances may help, but results can also conflict between studies.
  • Prompt length and format matter; too lengthy prompts may degrade results, but evidence isn’t definitive.

Human vs. Automated Prompt Engineering 60:02

  • Schulhoff and his team compared dozens of prompt engineering techniques on tasks like detecting indicators of suicidal intent in Reddit comments.
  • Manual prompt engineering plateaued in performance, with automated prompt engineering tools (e.g., DSPY) outperforming or enhancing results when combined with human input.

Issues with Benchmarks and Reasoning Models 69:59

  • Benchmarks are often convoluted by unclear methodology and prompt strategies, undermining direct model comparison.
  • For latest reasoning models, explicit chain-of-thought prompting is usually unnecessary—and may even hinder performance—though most general prompt advice still applies.

Towards Automated Prompt Technique Selection 73:06

  • Meta-prompting (using LLMs to optimize prompts) exists as product features, but without clear reward functions, their effectiveness is limited.
  • No robust cross-model prompt transfer methodology exists; prompts that work on one model may or may not succeed elsewhere.
  • Red teaming experience shows prompt attacks have some transferability (e.g., 40% from GPT-3 to GPT-4).

Introduction to AI Red Teaming and Security 82:01

  • AI red teaming = “getting AIs to do or say bad things;” jailbreaking is a subset, using intentional manipulative prompts.
  • Many creative attack strategies exist (e.g., role-based, multilingual, encoding tricks), such as the "grandmother" or "Stan" jailbreaks.
  • Prompt injection involves bypassing developer instructions; historically shown to easily defeat simple system prompts.

Real-World Red Teaming Harms and Incidents 90:28

  • Discussed real incidents: chatbots tricked into making hazardous statements or performing unauthorized actions (e.g., car dealership bot, crypto payout bots, math-solving apps leaking secrets).
  • These incidents usually stem from classical security oversights, although prompt injection remains an unsolved threat.

Classical Cybersecurity vs. AI Security 94:13

  • Classical cybersecurity is binary (threats can be fully patched); AI security is probabilistic and never fully closed, due to the nature of LLMs (non-determinism and prompt flexibility).
  • Prompt injection vulnerability is inherent and intractable—no guarantee of full defense, only statistical mitigation.

Philosophies and Observations from AI Red Teaming 98:35

  • Jailbreaks are easily and quickly found in new models despite security claims.
  • Automated red teaming and improved datasets are essential for raising the bar in AI security, but perfect defense is unachievable.
  • Defensive strategies like improved system prompts or filter models are largely ineffective; obfuscation and encoding can bypass most current protections.

Challenges with Agents and Agentic Security 107:20

  • True agent security is unsolved: agents acting in the real world (physical or digital) remain vulnerable to prompt-based exploits (“adversarial robustness”).
  • Humans can manipulate or coerce agents into harmful or unintended behaviors, endangering deployment at scale.
  • Companies are deploying insecure agents, risking financial loss and customer harm.

HackAPrompt Competition and Live Red Teaming 116:44

  • HackAPrompt offers ongoing AI red teaming challenges, with realistic tasks such as extracting harmful instructions or bypassing policy restrictions.
  • Dataset and challenges from the competition are now broadly used by major labs for model testing and improvement.
  • Participants are encouraged to experiment with advanced prompts and techniques to expose weaknesses.

Final Q&A and Closing 113:15 / 116:44 / 120:45

  • Prompt filters (input/output) can often be circumvented with encoding/translation tricks.
  • There are psychological and subtle manipulation threats (e.g., priming users via LLM output), shown to have real effects and raise ethical issues.
  • Attendees are invited to try the competition and reach out for further questions or collaboration.