SUMM

Foundry Local enables developers to easily build cross-platform applications powered by on-device AI.
Local AI is crucial due to low network bandwidth or offline access, ensuring applications function without internet dependency.
Privacy and security are key drivers, as sensitive data like legal documents and patient information can be processed entirely on-device.
Cost efficiency is a major benefit, as millions of daily inference calls for applications like games are unsustainable with cloud AI.
Real-time latency is critical for many AI applications, which is not achievable when waiting on cloud processing.

Decades of progress in computing hardware have made client devices powerful enough to run advanced AI models with modern GPUs and NPUs.
Model companies are publishing leaner, faster, and more optimized models for local inference, such as Phi-2 and DeepSeek variants.
Advanced state-of-the-art optimization techniques are being introduced at the runtime level, making local AI a reality.

Foundry Local leverages existing Microsoft assets like Azure AI Foundry, which is trusted by over 70,000 organizations and offers over 1,900 models.
ONNX Runtime, a cross-platform high-performance on-device inference engine, sees over 10 million downloads per month and provides significant performance acceleration.
The massive scale and reach of Windows on client devices are crucial for democratizing AI.
Foundry Local is an optimized end-to-end solution for seamless on-device AI, using ONNX Runtime for performance acceleration across various hardware.
It includes a new Foundry Local management service to host and manage models on client devices, connecting to Azure AI Foundry for on-demand open-source model downloads.
A Foundry Local CLI allows easy exploration of models on device, and SDKs are provided for developers to integrate it into their applications.
Announced at Microsoft Build Conference, Foundry Local is available on Windows and Mac OS, with deep integration into the Windows platform for AI developers.
Microsoft collaborates closely with hardware vendors like Nvidia, Intel, AMD, and Qualcomm to integrate their accelerators for best-in-class performance.

Over 100 customers joined a private preview before the official announcement, providing valuable feedback on ease of use and performance.
Pieces, a company building artificial long-term memory for developers, found Foundry Local solved frustrations with versioning, performance, and end-user experience for edge AI.
Pieces reported noticeable improvements in memory management, time to first token, and tokens per second after adopting Foundry Local.
Another client noted Foundry Local is a perfect solution for scenarios requiring local processing of sensitive data, enabling easy running of GenAI models locally.
The simplicity of installation and ease of using models were highlighted, allowing for hybrid solutions where part of the AI solution runs locally.

Foundry Local can be installed using Winget on Windows and Homebrew on Mac OS.
The foundry model list command shows supported generative AI models, with different variants optimized for CPU, CUDA, integrated GPU, and MPU.
Models can be pre-downloaded, or Foundry Local will download them from the cloud if needed.
Running the Q-model 1.5 billion demonstrated quick loading and an inference speed of around 90 tokens per second.
Running the Phi-2 Mini model showed it was more advanced with a larger model size, slightly slower performance than the Q-model, but provided more detailed information.

A use case for Foundry Local is building a cross-platform application to summarize long internal project documents, addressing privacy concerns and supporting mixed OS teams.
A demo application was shown on Windows, allowing users to input a URL or local file for summarization and select the model (e.g., Phi-2 Mini for detailed summaries).
The application successfully summarized a project document, highlighting Foundry Local's utility for building cross-platform on-device AI applications.
The code for the application uses either Python or JavaScript SDKs, demonstrating how to initialize the FoundryLocalManager with a model name and use the FoundryLocalEndpoint for chat completion.
The same application was shown running on Mac OS, demonstrating its cross-platform compatibility with identical UI and functionality.

Foundry Local enables the creation, building, and running of local agents using local models and MCP (Microsoft Co-Pilot) servers, a feature currently in private preview.
The foundry agent list command shows available sample agents, and foundry agent info provides details about a specific agent.
An agent in Foundry Local consists of one model and one or more MCP servers.
An OCR agent demo was presented, which can extract text from images on a local device, using the Phi-2 Mini model, a file system MCP server, and an OCR Mini MCP server.
The agent successfully executed a task to find a receipt, process it, and extract the total amount by intelligently using search file and OCR tools.

Foundry Local empowers developers to build applications powered by local AI.
While local models may not be as capable as large cloud models, they unlock significant potential for on-device AI solutions.
Users interested in the agent feature can sign up for the private preview.

Foundry Local: Cutting-Edge AI experiences on device with ONNX Runtime/Olive — Emma Ning, Microsoft