SUMM

The video introduces SmolLM3, a new 3B parameter model released by Hugging Face, with both base, instruct, and ONNX versions available.
SmolLM3 targets agentic use cases, including function calling and running local agents, aiming to reduce reliance on proprietary models.
The 3B size is highlighted as suitable for many mobile devices, outperforming previous models like Quen 2.53B and Llama 3.2B.

SmolLM3 was trained on 11 trillion tokens, which is unusually large for the 3B parameter segment.
Claims state-of-the-art results among 3B models and strong competition with some 4B models.
The model supports a long context window of up to 128K tokens (potentially up to 256K).
Noted as "multilingual" with support for six European languages.
The architecture draws from Llama 3 and includes features like group query attention and a novel mechanism (nope) for position encoding.
Includes architectural insights from other models, e.g., removing weight decay from embeddings for training stability, inspired by MO 2.

Hugging Face released a comprehensive "blueprint" detailing all training phases, from data selection, distributed setup, long context handling, to post-training.
The pre-training used 384 H100 GPUs over 24 days (~220,000 GPU hours), suggesting training costs in the several-hundred-thousand-dollar range.
The model underwent a three-phase pre-training, initially web-heavy, then increasing code and math data in later phases.
Relies on Deepseek R1 and Quen 3 for generating synthetic reasoning traces.
Utilized a new alignment variant based on DPO and employed model checkpoint merging to produce a final model.
Hugging Face published the datasets and training methodology openly, fostering transparency compared to proprietary labs.

SmolLM3 is easy to use with Hugging Face Transformers, as well as SG Lang and VLM tools.
Basic prompt control allows toggling of the "thinking" process by adjusting the system prompt (e.g., adding "\n nothink").
When "thinking" is enabled, answers are often prefixed by a detailed reasoning chain, though the reasoning is not partitioned into enumerated steps.
The chain-of-thought is generally lengthy but sometimes outputs empty "thinking" sections, especially for code generation.
Compared to outputs with "no thinking," answers with reasoning are more thorough and justified.
The reasoning abilities are considered strong for a model of this size, rivaling older, much larger models.
The model sometimes provides unexpected outputs, such as omitting "thinking" on certain prompts or over-generating justification.

SmolLM3 supports function calling/tool use, relevant for building local AI agents.
Tested by defining tools (with schemas) and prompting the model to invoke them (e.g., weather lookup, web search).
The model generates accurate tool calls based on prompts and context.
For multiple tools present, the model sometimes correctly refrains from calling a tool when unnecessary, but not always—behavior depends on prompt and tool descriptions.
The model uses search tools to answer questions beyond its knowledge cutoff, seen as a positive trait for its intended applications.

SmolLM3 is noted for its open release, including both the model weights and an extensive blueprint of its creation.
Further releases might include intermediate checkpoints for more granular user experimentation.
The model is already being converted for use in various runtime environments (e.g., GGUF, LM Studio).
The video closes with questions to viewers about their preferences for local versus proprietary models and notes increasing use cases for both.

SmolLM3 - A Local Agents Winner?