Foundry Local: Cutting-Edge AI experiences on device with ONNX Runtime/Olive — Emma Ning, Microsoft

Introduction to Foundry Local 00:20

  • Foundry Local enables developers to easily build cross-platform applications powered by on-device AI.
  • Local AI is crucial due to low network bandwidth or offline access, ensuring applications function without internet dependency.
  • Privacy and security are key drivers, as sensitive data like legal documents and patient information can be processed entirely on-device.
  • Cost efficiency is a major benefit, as millions of daily inference calls for applications like games are unsustainable with cloud AI.
  • Real-time latency is critical for many AI applications, which is not achievable when waiting on cloud processing.

The Rise of Local AI 02:27

  • Decades of progress in computing hardware have made client devices powerful enough to run advanced AI models with modern GPUs and NPUs.
  • Model companies are publishing leaner, faster, and more optimized models for local inference, such as Phi-2 and DeepSeek variants.
  • Advanced state-of-the-art optimization techniques are being introduced at the runtime level, making local AI a reality.

Foundry Local's Foundation 03:19

  • Foundry Local leverages existing Microsoft assets like Azure AI Foundry, which is trusted by over 70,000 organizations and offers over 1,900 models.
  • ONNX Runtime, a cross-platform high-performance on-device inference engine, sees over 10 million downloads per month and provides significant performance acceleration.
  • The massive scale and reach of Windows on client devices are crucial for democratizing AI.
  • Foundry Local is an optimized end-to-end solution for seamless on-device AI, using ONNX Runtime for performance acceleration across various hardware.
  • It includes a new Foundry Local management service to host and manage models on client devices, connecting to Azure AI Foundry for on-demand open-source model downloads.
  • A Foundry Local CLI allows easy exploration of models on device, and SDKs are provided for developers to integrate it into their applications.
  • Announced at Microsoft Build Conference, Foundry Local is available on Windows and Mac OS, with deep integration into the Windows platform for AI developers.
  • Microsoft collaborates closely with hardware vendors like Nvidia, Intel, AMD, and Qualcomm to integrate their accelerators for best-in-class performance.

Customer Experiences 06:19

  • Over 100 customers joined a private preview before the official announcement, providing valuable feedback on ease of use and performance.
  • Pieces, a company building artificial long-term memory for developers, found Foundry Local solved frustrations with versioning, performance, and end-user experience for edge AI.
  • Pieces reported noticeable improvements in memory management, time to first token, and tokens per second after adopting Foundry Local.
  • Another client noted Foundry Local is a perfect solution for scenarios requiring local processing of sensitive data, enabling easy running of GenAI models locally.
  • The simplicity of installation and ease of using models were highlighted, allowing for hybrid solutions where part of the AI solution runs locally.

CLI and Model Demonstrations 08:47

  • Foundry Local can be installed using Winget on Windows and Homebrew on Mac OS.
  • The foundry model list command shows supported generative AI models, with different variants optimized for CPU, CUDA, integrated GPU, and MPU.
  • Models can be pre-downloaded, or Foundry Local will download them from the cloud if needed.
  • Running the Q-model 1.5 billion demonstrated quick loading and an inference speed of around 90 tokens per second.
  • Running the Phi-2 Mini model showed it was more advanced with a larger model size, slightly slower performance than the Q-model, but provided more detailed information.

Building a Cross-Platform Local AI Application 13:00

  • A use case for Foundry Local is building a cross-platform application to summarize long internal project documents, addressing privacy concerns and supporting mixed OS teams.
  • A demo application was shown on Windows, allowing users to input a URL or local file for summarization and select the model (e.g., Phi-2 Mini for detailed summaries).
  • The application successfully summarized a project document, highlighting Foundry Local's utility for building cross-platform on-device AI applications.
  • The code for the application uses either Python or JavaScript SDKs, demonstrating how to initialize the FoundryLocalManager with a model name and use the FoundryLocalEndpoint for chat completion.
  • The same application was shown running on Mac OS, demonstrating its cross-platform compatibility with identical UI and functionality.

Exploring Local AI Agents 17:58

  • Foundry Local enables the creation, building, and running of local agents using local models and MCP (Microsoft Co-Pilot) servers, a feature currently in private preview.
  • The foundry agent list command shows available sample agents, and foundry agent info provides details about a specific agent.
  • An agent in Foundry Local consists of one model and one or more MCP servers.
  • An OCR agent demo was presented, which can extract text from images on a local device, using the Phi-2 Mini model, a file system MCP server, and an OCR Mini MCP server.
  • The agent successfully executed a task to find a receipt, process it, and extract the total amount by intelligently using search file and OCR tools.

Summary and Future Potential 21:54

  • Foundry Local empowers developers to build applications powered by local AI.
  • While local models may not be as capable as large cloud models, they unlock significant potential for on-device AI solutions.
  • Users interested in the agent feature can sign up for the private preview.