This might be bigger than DeepSeek

Introduction & Kimmy K2 Overview 00:00

  • Kimmy K2 is a new open-weight AI model from Moonshot in China, focusing on advancements in agentic (tool-using) models.
  • The model is a mixture-of-experts with one trillion parameters; each inference only activates a subset of parameters.
  • Download size is 960GB, highlighting its massive scale.
  • Released under a modified MIT license, which restricts some high-scale commercial uses unless attribution is given.
  • The model excels in tool/function calling, representing a significant step forward for open models in this area.

Model License and Limitations 03:10

  • The modified MIT license requires prominent attribution (“Kimmy K2”) in the UI for commercial products above 100 million monthly users or $20 million monthly revenue.
  • There are concerns about the legal enforceability and open-source compatibility of the license.
  • The license ambiguity raises questions about how it applies to derivative works and distillations.

Benchmarks and Performance 04:29

  • Kimmy K2 achieves state-of-the-art results on SWE Bench, Tau, and AceBench among open models.
  • It rivals or exceeds closed models (e.g., Claude 4 Opus, GPT-4.1) in specific benchmarks for code and agentic tasks.
  • Current limitations: no support for multimodal inputs or dedicated reasoning mode (planned for the future).
  • Its API is cheaper than comparable models from competitors like Anthropic.

DeepSeek & Reasoning Era Background 06:43

  • DeepSeek V3 inspired the presenter to build T3 Chat, an interface for better user experience with AI models.
  • DeepSeek R1 was a watershed open model that introduced reasoning by exposing intermediate “reasoning tokens” and methodology, allowing others to train and distill similar models.

Tool Calling in AI Models – The Market Context 16:08

  • Before DeepSeek R1, only OpenAI’s “01” model offered effective reasoning; similarly, until recently, Anthropic’s Claude models set the standard for reliable tool/function calling.
  • Tool calling allows AI to trigger external functions for more context-rich and interactive responses.
  • Anthropic’s Claude 3.5, 3.7, and 4 are benchmarks for tool calling reliability; their monopoly stems from robust accuracy (e.g., 98% tool call accuracy has exponential effects on multi-step tasks).
  • Despite intelligence, competitors like Gemini and Grok struggle with tool call reliability and adherence to syntax.

Kimmy K2’s Tool Calling Revolution 24:24

  • Kimmy K2 is the first open model to rival Anthropic’s Claude models in tool calling reliability and agentic capabilities.
  • Demonstrated success in complex benchmarks (e.g., automatically building 3D scenes, running particle simulations, effective API calls).
  • Outperforms peers in Minecraft “MCBench” by using tools methodically and without random errors.
  • Consistently avoids malformed outputs and errors in controlled tests—much higher reliability than previous open models.

Practical Drawbacks and Distillation Potential 28:32

  • Major drawback: Kimmy K2 is slow—with speeds (tokens per second) significantly below competitors.
  • Like DeepSeek R1, the large model’s best use may be to generate synthetic data for training smaller, faster “distilled” models.
  • K2’s capability to output vast amounts of high-quality tool call data could benefit the entire AI model ecosystem by enabling better agentic models via distillation.

License Implications and Industry Impact 32:06

  • The ambiguous license complicates using K2-derivative data/models in commercial products above certain thresholds.
  • Enforcement and interpretation remain uncertain, especially relating to multi-layered usage (e.g., using a third-party API or distilling data into new models).
  • Even with these caveats, K2 unlocks large-scale generation of tool call training data—something previously only feasible with access to closed model APIs (like Anthropic's), which are restrictive.

Synthetic Data and Future Model Development 42:00

  • Synthetic data, especially for formal or structured tasks, is proving at least as effective—sometimes more so—than real data for AI training.
  • Previous research (e.g., DeepSeek’s theorem-proving paper) shows that massive amounts of synthetic data enhance model capabilities.
  • K2’s high-quality outputs and reliability make it a valuable source for synthetic dataset creation.

Conclusion: K2’s Lasting Significance 44:55

  • While K2’s slow performance makes it less suitable for direct daily use, its real value lies in accelerating the wider ecosystem’s progress—especially around tool calling in agentic models.
  • Its open weights should fuel better models and distillations, with broad downstream benefits.
  • K2 marks a fundamental advance for open models, matching or surpassing closed models in functionality that was previously exclusive.
  • The expansion of open-weight models will exponentially increase training data and model capabilities for the field as a whole.