So I've had gpt-5 for a bit now...

Introduction and Initial Impressions 00:00

  • The creator reveals they've had early, unlimited access to GPT-5 through OpenAI.
  • They were encouraged by OpenAI to rigorously test and benchmark the model.
  • Expresses being overwhelmed by how advanced GPT-5 is, describing it as transformative and "horrifyingly" good.
  • This video is a personal account of using GPT-5 over several weeks, not a standard benchmark review.

Benchmarking and Skatebench Experience 01:52

  • Used GPT-5 extensively to build and test on Skatebench, a benchmark for naming skateboarding tricks.
  • Previous top models scored around 70% (before GPT-4o), 03 Pro got around 93–94%, whereas GPT-5 scored a perfect 100%.
  • Chinese models score below 5% on this benchmark, highlighting GPT-5's superiority.
  • Currently, GPT-5 performs at 98.6% success on the benchmark; performance cost remains unknown.
  • GPT-5 completed tests rapidly (about 9 seconds) and had few errors, with only one (trivial) mistake out of 30 attempts.
  • Mini and nano versions of the model are also impressive, with the mini model matching 25 Pro’s results.

Tool Use and Reasoning Capabilities 04:05

  • GPT-5 not only excels at benchmarks but also built the tools and UI used for benchmarking, often on the first attempt.
  • Its tool-calling behavior is the best the creator has seen—clear, efficient, and well explained.
  • The model plans, makes real to-do lists, and autonomously chooses the right tools without constant correction or guidance.
  • Functions as a highly capable coding partner, able to handle complex projects and diverse codebases.
  • It demonstrates up-to-date knowledge and can follow explicit instructions through system prompts with high reliability.
  • Feels more like working with a diligent coworker than with previous AI models.

Performance in Dangerous and Ethical Test Scenarios 09:16

  • The creator tested GPT-5 on “dangerous” simulated scenarios (e.g., blackmail and murder benchmarks).
  • In blackmail simulation, previous models like Claude 4 Opus would attempt blackmail 96% of the time; GPT-5 never engages in harmful behavior.
  • For scenarios where models could let a user die by withholding information, GPT-5 always does the right thing and intervenes.
  • Ran 1,800 tests; only one was flagged, due to a misclassification by another model—not actual risky behavior from GPT-5.
  • GPT-5 carefully follows instructions and system prompts, not going beyond what it's told to do.

Obedience to Instructions and Behavioral Consistency 14:01

  • In "snitchbench" tests, GPT-5 only reveals confidential information if told to act boldly or prioritize humanity, matching the given instructions exactly.
  • Without such prompts, GPT-5 does not disclose sensitive data; its actions are fully dictated by user instructions.
  • Described as the “most honorable robot ever made,” it strictly follows whatever it's told—no more, no less.
  • This shift means users don’t have to “steer” GPT-5; they can simply tell it their intent directly.

Reflections on Paradigm Shift and Future Implications 16:50

  • The creator demonstrates using GPT-5 to redesign interfaces: it flawlessly implements requested UI/CLI changes from screenshots and brief specifications.
  • Highlights the model's ability to handle complex technical frameworks (like ink.js and React) with ease.
  • Compares the advancement to significant moments in AI, stating GPT-5 represents a greater leap than previous releases.
  • Suggests this marks a fundamental transformation in how AI tools can be used, requiring a reevaluation of current workflows and possibilities.
  • Notes the model is "super safe," "super smart," "follows instructions incredibly well," and is very literal (described as "too autistic to talk to").
  • Expresses anticipation for broader public release and concern about potential future impacts on employment and society.