Building Multimodal AI Agents From Scratch — Apoorva Joshi, MongoDB

Introduction and Workshop Overview 01:49

  • The workshop aims to teach participants about AI agents and multimodality, culminating in building a multimodal agent from scratch using Python.
  • The lead instructor, Apoorva Joshi, is an AI-focused developer advocate at MongoDB with six years of prior experience as a data scientist in cybersecurity.
  • The session is approximately 1 hour and 20 minutes, with about 55 minutes to an hour dedicated to hands-on coding or understanding the code.

Understanding AI Agents 02:33

  • An AI agent is defined as a system that uses a Large Language Model (LLM) to reason through a problem, create a plan, and execute/iterate on that plan with tools.
  • Three main paradigms for interacting with LLMs are simple prompting, Retrieval Augmented Generation (RAG), and agents.
  • Simple prompting relies on the LLM's pre-trained knowledge, limiting its ability to answer questions outside its parametric knowledge, handle complex queries, or provide personalized responses.
  • RAG augments the LLM's knowledge with external data sources, addressing some limitations but still not equipping the LLM for complex multi-step tasks or self-refinement.
  • AI agents are suitable for complex, multi-step tasks, deep personalization, and adaptive learning, as they give the LLM agency to determine action sequences using tools and reasoning.
  • Agents come with higher cost and latency due to the LLM's extensive processing, so they should only be used when necessary for tasks with high latency tolerance or non-deterministic outputs.

Components of AI Agents 07:10

  • An agent typically has four main components: Perception, Planning and Reasoning, Tools, and Memory.
  • Perception is how agents gather information from their environment through inputs like text, images, voice, or video; this workshop focuses on text and images.
  • Planning and Reasoning is handled by LLMs, which determine how to solve a problem with guidance from prompts.
    • Planning without feedback (e.g., Chain of Thought) involves prompting the LLM to think step-by-step without modifying its initial plan based on tool outcomes.
    • Planning with feedback (e.g., ReAct - Reasoning and Act) involves prompting the LLM to generate verbal reasoning traces and actions, making observations after each action to inform the next step until a final answer is reached.
  • Tools are external interfaces (e.g., APIs, vector stores, ML models) that agents use to interact with the external world and achieve objectives. LLMs are trained to identify when to call a tool and its arguments, but the agent's code must execute the function, typically requiring a JSON tool schema.
  • Memory allows agents to store and recall past conversations, enabling learning from interactions.
    • Short-term memory deals with information from a single conversation.
    • Long-term memory stores and retrieves information over multiple conversations, facilitating personalization. This lab implements short-term memory.

Example Agent Workflow 14:13

  • A user query (e.g., "What's the weather in San Francisco today?") is sent to the agent's LLM, which has access to tools (like a weather API) and memory.
  • The LLM decides which tool is most suitable (e.g., weather API), extracts arguments (e.g., "San Francisco"), and the agent's code executes the tool.
  • The tool's response is forwarded back to the LLM, which then decides if more information is needed or if it has the final answer, generating a natural language response for the user if complete.

Multimodality in AI 16:01

  • Multimodality in AI refers to machine learning models' ability to process, understand, and generate different data types like text, images, audio, and video.
  • Real-world data often combines images and text, such as graphs, tables, research papers, financial reports, and healthcare documents.
  • Two classes of multimodal machine learning models exist:
    • Multimodal embedding models take various data types as input and generate embeddings for unified search and retrieval.
    • Multimodal LLMs (e.g., DeepMind, Claude, OpenAI) can take diverse data types as input and generate outputs in different formats.
  • A multimodal agent is formed when a multimodal LLM is given tools to search multimodal data and uses its reasoning capabilities to solve complex problems.

Building a Multimodal Agent: Data Preparation and Architecture 18:23

  • The workshop's agent aims to answer questions about a large corpus of documents and explain charts/diagrams, even when documents have mixed modalities (text, images, tables).
  • Retrieving information from mixed-modality documents is challenging because traditional text-chunking methods don't work for images and tables.
  • Previous techniques for mixed-modality data involved extracting elements, chunking text, summarizing images/tables, and then embedding everything into the text domain, or embedding all elements using a multimodal embedding model.
  • Limitations of these processing pipelines include context loss at chunk boundaries and increased complexity due to multiple processing steps (object recognition, summarization).
  • Older multimodal embedding models (e.g., CLIP) suffered from a "modality gap" where text and images were processed through separate networks, leading to irrelevant items of the same modality appearing closer in vector space than relevant items of different modalities.
  • Vision Language Model (VLM) based architectures overcome this by vectorizing both modalities using the same encoder, creating a unified representation and preserving contextual relationships between text and visual data.
  • The data preparation pipeline for the agent involves converting each document page into a screenshot, storing these screenshots locally (or in blob storage), and then storing their embeddings (generated by a VLM like Voyage multimodal 3) and path references as metadata in a vector database (MongoDB in this lab). Raw screenshots are not stored in the vector database.

Multimodal Agent Workflow and Memory Management 31:40

  • The agent's workflow involves a query being forwarded to a multimodal LLM (e.g., Gemini 2.0 Flash experimental), which has access to a vector search tool and memory.
  • Based on the query, the LLM may call the vector search tool, which returns references to screenshots (not the raw images).
  • The agent then retrieves the actual screenshots from local/blob storage using these references.
  • These images, along with the original query and past conversational history, are passed to the multimodal LLM for final answer generation.
  • If a query doesn't require a tool (e.g., "summarize this image"), the LLM generates the answer directly.
  • For short-term memory, each user query is associated with a session ID, and the agent queries a database for past chat history for that session. This history is passed to the LLM as additional context.
  • After the LLM generates a response, the current response and query are added back to the database to update the session's history. At a minimum, the LLM's response and user queries should be logged.

Hands-on Lab Instructions 35:21

  • Participants can access the hands-on lab via a provided GitHub URL, which contains a readme for setup (approx. 10 minutes).
  • Two notebooks are available: lab.ipb for writing code with inline documentation, and solutions.ipb with pre-filled code for reference or running through.