The workshop aims to teach participants about AI agents and multimodality, culminating in building a multimodal agent from scratch using Python.
The lead instructor, Apoorva Joshi, is an AI-focused developer advocate at MongoDB with six years of prior experience as a data scientist in cybersecurity.
The session is approximately 1 hour and 20 minutes, with about 55 minutes to an hour dedicated to hands-on coding or understanding the code.
An AI agent is defined as a system that uses a Large Language Model (LLM) to reason through a problem, create a plan, and execute/iterate on that plan with tools.
Three main paradigms for interacting with LLMs are simple prompting, Retrieval Augmented Generation (RAG), and agents.
Simple prompting relies on the LLM's pre-trained knowledge, limiting its ability to answer questions outside its parametric knowledge, handle complex queries, or provide personalized responses.
RAG augments the LLM's knowledge with external data sources, addressing some limitations but still not equipping the LLM for complex multi-step tasks or self-refinement.
AI agents are suitable for complex, multi-step tasks, deep personalization, and adaptive learning, as they give the LLM agency to determine action sequences using tools and reasoning.
Agents come with higher cost and latency due to the LLM's extensive processing, so they should only be used when necessary for tasks with high latency tolerance or non-deterministic outputs.
An agent typically has four main components: Perception, Planning and Reasoning, Tools, and Memory.
Perception is how agents gather information from their environment through inputs like text, images, voice, or video; this workshop focuses on text and images.
Planning and Reasoning is handled by LLMs, which determine how to solve a problem with guidance from prompts.
Planning without feedback (e.g., Chain of Thought) involves prompting the LLM to think step-by-step without modifying its initial plan based on tool outcomes.
Planning with feedback (e.g., ReAct - Reasoning and Act) involves prompting the LLM to generate verbal reasoning traces and actions, making observations after each action to inform the next step until a final answer is reached.
Tools are external interfaces (e.g., APIs, vector stores, ML models) that agents use to interact with the external world and achieve objectives. LLMs are trained to identify when to call a tool and its arguments, but the agent's code must execute the function, typically requiring a JSON tool schema.
Memory allows agents to store and recall past conversations, enabling learning from interactions.
Short-term memory deals with information from a single conversation.
Long-term memory stores and retrieves information over multiple conversations, facilitating personalization. This lab implements short-term memory.
A user query (e.g., "What's the weather in San Francisco today?") is sent to the agent's LLM, which has access to tools (like a weather API) and memory.
The LLM decides which tool is most suitable (e.g., weather API), extracts arguments (e.g., "San Francisco"), and the agent's code executes the tool.
The tool's response is forwarded back to the LLM, which then decides if more information is needed or if it has the final answer, generating a natural language response for the user if complete.
Multimodality in AI refers to machine learning models' ability to process, understand, and generate different data types like text, images, audio, and video.
Real-world data often combines images and text, such as graphs, tables, research papers, financial reports, and healthcare documents.
Two classes of multimodal machine learning models exist:
Multimodal embedding models take various data types as input and generate embeddings for unified search and retrieval.
Multimodal LLMs (e.g., DeepMind, Claude, OpenAI) can take diverse data types as input and generate outputs in different formats.
A multimodal agent is formed when a multimodal LLM is given tools to search multimodal data and uses its reasoning capabilities to solve complex problems.
Building a Multimodal Agent: Data Preparation and Architecture 18:23
The workshop's agent aims to answer questions about a large corpus of documents and explain charts/diagrams, even when documents have mixed modalities (text, images, tables).
Retrieving information from mixed-modality documents is challenging because traditional text-chunking methods don't work for images and tables.
Previous techniques for mixed-modality data involved extracting elements, chunking text, summarizing images/tables, and then embedding everything into the text domain, or embedding all elements using a multimodal embedding model.
Limitations of these processing pipelines include context loss at chunk boundaries and increased complexity due to multiple processing steps (object recognition, summarization).
Older multimodal embedding models (e.g., CLIP) suffered from a "modality gap" where text and images were processed through separate networks, leading to irrelevant items of the same modality appearing closer in vector space than relevant items of different modalities.
Vision Language Model (VLM) based architectures overcome this by vectorizing both modalities using the same encoder, creating a unified representation and preserving contextual relationships between text and visual data.
The data preparation pipeline for the agent involves converting each document page into a screenshot, storing these screenshots locally (or in blob storage), and then storing their embeddings (generated by a VLM like Voyage multimodal 3) and path references as metadata in a vector database (MongoDB in this lab). Raw screenshots are not stored in the vector database.
Multimodal Agent Workflow and Memory Management 31:40
The agent's workflow involves a query being forwarded to a multimodal LLM (e.g., Gemini 2.0 Flash experimental), which has access to a vector search tool and memory.
Based on the query, the LLM may call the vector search tool, which returns references to screenshots (not the raw images).
The agent then retrieves the actual screenshots from local/blob storage using these references.
These images, along with the original query and past conversational history, are passed to the multimodal LLM for final answer generation.
If a query doesn't require a tool (e.g., "summarize this image"), the LLM generates the answer directly.
For short-term memory, each user query is associated with a session ID, and the agent queries a database for past chat history for that session. This history is passed to the LLM as additional context.
After the LLM generates a response, the current response and query are added back to the database to update the session's history. At a minimum, the LLM's response and user queries should be logged.
Participants can access the hands-on lab via a provided GitHub URL, which contains a readme for setup (approx. 10 minutes).
Two notebooks are available: lab.ipb for writing code with inline documentation, and solutions.ipb with pre-filled code for reference or running through.