OpenAI’s open source models are finally here

Introduction and Model Overview 00:00

  • OpenAI released two open weight models: a 120 billion parameter (bill) and a 20 bill parameter model.
  • The 20 bill model runs on most devices with a modern GPU, including smartphones.
  • The 120 bill model runs on more powerful gaming hardware and cloud services.
  • Cerebras demonstrated the new model running at 3,000 tokens per second, about 30x faster than similar models from traditional providers.
  • These models exhibit intelligence close to GPT-3 (03) and GPT-4 Mini (04 Mini), though with variability across tasks.

Running the Models: Hardware & Performance 03:50

  • The 120 bill model is around 60GB in size; the 20 bill model is about 11GB.
  • Both are mixtures of experts (MoE); a request activates only a fraction of the parameters—5.1 billion active parameters per request for the 120 bill model.
  • 20 bill model runs smoothly on laptops and likely on phones; 120 bill model can severely tax a laptop’s memory and is much better suited for desktops or cloud.
  • Cloud services like T3 Chat offer fast generation: the 20 bill model is free to use, 120 bill available with a subscription.

Unique Characteristics & Model Architecture 04:29

  • Mixture of experts setup leads to efficiency: only necessary “experts” are used for a particular prompt.
  • The models use a specific bracket-bar syntax for tool calling and message formatting.
  • Model code is open sourced in Rust and Python.

Safety, Release Delay, and Licensing 09:22

  • Unlike OpenAI’s historical models, these open weights have no safety middleware layer—raising concerns about content control and misuse.
  • Release was delayed to ensure safety given that, once weights are published, they can’t be retracted.
  • Models are Apache 2 licensed, allowing unrestricted, commercial and private use.

Coding, Tool Use, and Real-world Benchmarks 12:46

  • These are not the “Horizon” models (which are especially strong at coding), and perform less impressively at some code generation tasks.
  • Tool calling reliability varies widely across providers due to differences in the custom “Harmony” middleware implementation.
  • Error rates and edge cases (like missing required JSON fields) appear more frequently in the 20 bill model.
  • Specific benchmarks, like “SnitchBench” (sensitivity to prompts) and “Skatebench” (knowledge of skate trick names), show the GPT OSS models performing better than many Chinese models but below the top proprietary models.

Benchmark Results & Comparison to Other Models 20:24

  • OpenAI claims the 120 bill model matches 03/04 Mini in reasoning tools and the 20 bill model compares to 03 Mini.
  • In specialized tests (e.g., HealthBench), the OSS 120 bill model outperforms 04 Mini.
  • The 20 bill model offers comparable performance to 03 Mini, despite being much smaller.
  • Benchmarks by Artificial Analysis place the 120 bill model between Quen 3 (235B) and Gemini 2.5 Flash in general intelligence.
  • Cost is a significant advantage: open models are much cheaper than proprietary ones, with 15–25 cents per million input tokens for the largest model.

Technical Insights: Architecture and Capabilities 28:07

  • Both models are text-only and sparsely activate parameters for efficiency.
  • 120 bill model: 36 layers, 64 query heads per layer; 20 bill model: 24 layers.
  • Rotary embeddings and “yarn” extend context windows to 128K tokens.
  • The models activate a small percent of parameters per prompt, with higher sparsity at larger sizes.

Strengths, Weaknesses & Use Cases 29:44

  • These are not the same as Horizon models; they have mixed performance in coding (especially front-end, CSS).
  • Outperform specific open models (e.g., some Chinese models), but DeepSeek, Kimmy, and GLM may sometimes do better depending on the task.
  • Particularly strong in science, health, and privacy-sensitive applications.
  • Models tend to overuse tables in their responses and aren’t optimized for friendly or conversational output.
  • Safety measures were a release bottleneck, aiming to ensure the models are safe even without an intermediary filtering layer.

Conclusion 30:27

  • OpenAI's open models are significant for open source AI: they provide competitive intelligence, reasonable size, affordability, and strong instruction following.
  • They can run on consumer hardware, making advanced AI more accessible.
  • Future comparisons with upcoming models (like GPT-5) and further benchmarking are expected.