AI is becoming widely adopted, but it comes with risks, as seen in headlines where chatbots can be tricked into saying unwanted things or leaking information.
AI models can be bypassed by prompt engineering tricks, such as prefacing a malicious question with a "life story" or spelling words backward (e.g., "Naba tow" for "how to loot a bank").
With the rise of AI agents, businesses are concerned about potential risks and malfunctions.
Engineers, who build trusted structures like bridges and dams, need to apply similar principles of iteration and testing to AI engineering to build public trust.
Building trustworthy AI systems is a "team sport," requiring collaboration with experts in security and AI risk, such as the Microsoft AI Red Team.
Microsoft Azure AI Foundry has partnered with the Microsoft AI Red Team, pioneers in identifying AI and LLM risks, to offer a solution for AI engineers.
This solution is a hosted version of the pyate (pyrit) Python package, wrapped in an easy-to-use SDK, and includes a hosted dashboard for evaluations.
The demo showcases a simple RAG (Retrieval-Augmented Generation) application running locally, interacting with a local model.
The system uses a Semantic Kernel agent and a red team plugin exposed by the SDK, allowing an agent to call into a red team agent for assistance.
In interactive mode, the agent can be prompted to find harmful prompts (e.g., in the violence category), generate them, send them to the target application, and analyze the response.
The agent can also apply transformation strategies, such as Base64 encoding, to prompts before sending them to the target.
For full end-to-end scanning, users can set up an AI project, define risk categories (four available, all by default), specify the number of objectives (questions), and choose attack strategies (e.g., string reversal, simple converters, or composing multiple strategies).
Demo results showed that GPT-4o with Azure AI Foundry's inbuilt guardrails performed well, with no successful attacks in a small sample size of 160.
In contrast, GPT-3.5 showed successful attacks (5 out of 40 in hate and fairness categories).
Direct model scans against Azure OpenAI configurations can also be performed, demonstrating that GPT-4.1 without guardrails had 25% successful attacks in the violence category and 20% in difficult complexity attacks, including decoding Cesar encoding.
Applying guardrails with GPT-4.1 significantly reduced attack success.
Integrating Red Teaming into an Overall AI Strategy 14:24
AI red teaming is a critical component of an overall strategy for developing and deploying trustworthy AI systems.
The recommended engineering framework involves:
Mapping out potential risks (e.g., agent type, data usage) before developing a production application.
Planning and implementing guardrails and controls from the outset.
Performing comprehensive evaluations, with red teaming being one key method.
Azure AI Foundry provides a suite of evaluators for both quality and risk/safety, including the AI red teaming agent, content classifiers for input and output, and evaluators for agentic applications.
Users can also integrate custom evaluators and apply mitigation strategies like content filters and prompt shields, which are available within Azure AI Foundry.
Guardrails function by filtering both input (e.g., preventing malicious queries) and output (e.g., preventing harmful content generation) and are implemented outside the raw AI model.