The Industry Reacts to GPT-5 (Confusing...)

Industry Reaction Overview 00:00

  • GPT-5 has been a polarizing launch, with reactions ranging from praise to disappointment.
  • Some users prefer GPT-4.0 and are unhappy about its retirement, valuing its personality and familiarity.
  • Sam Altman acknowledges the need for model customization for different users.
  • Simplicity benefits novices, but advanced users want to select models based on use case.
  • OpenAI plans to focus on stability, then adjust GPT-5 to have a "warmer" personality.

Independent Benchmarks & Model Performance 01:26

  • Artificial Analysis ran eight independent evaluations on GPT-5 with early access.
  • GPT-5 offers multiple "reasoning effort" configurations: high, medium, low, minimal.
  • These configurations greatly affect intelligence, token usage, speed, and cost.
  • High reasoning effort uses 82 million tokens, while minimal uses only 3.5 million, making it highly efficient.
  • GPT-5 scores 68 on the AI index, setting a new standard and topping most benchmarks.
  • GPT-5 is particularly strong in long-context reasoning, aiding agentic coding and large codebase analysis.
  • The artificial analysis index ranks GPT-5 high at 69, leading other models and restoring OpenAI to the number one AI spot.

Controversial Presentation & GraphGate 04:22

  • Some graphs shown during the GPT-5 launch were inaccurate, leading to "GraphGate."
  • Bar heights did not correspond to numbers—became a meme within the community.
  • Despite this, the model's capabilities are generally acknowledged as strong.

Additional Features & All-in-One Platforms 05:13

  • Sponsor segment highlights Chat LLM by Abacus: a unified platform for testing multiple models, including GPT-5.
  • Features include automatic model routing, document chat, image/video generation, and a multi-LLM agent for various tasks.

Further Evaluations: LM Arena 06:55

  • LM Arena also ranks GPT-5 as number one for text, web development, vision, coding, math, and creativity tasks.
  • GPT-5 tested under the name "summit" achieves ELO 1481, leading competitors by 20 points.

Beyond Benchmarks – Post-Eval Sentiment 07:48

  • Some in the industry argue that saturated benchmarks make further comparisons meaningless.
  • Theo GG and others emphasize the importance of real-world instruction-following and "vibes."
  • GPT-5 saturates certain tests like AMI 2025 (scoring 100%).
  • Some maintain that benchmarks can be updated for new qualities, while others value feel over data.

Polarized User and Developer Feedback 09:33

  • Opinions differ sharply: some see GPT-5 as a breakthrough, others call it a flop or worse than alternatives.
  • Stage Hand's API tests show GPT-5 trailing Opus 4.1 on both speed and accuracy for browsing use.
  • Open-source models (e.g., GPTO OSS 12B) sometimes perform surprisingly well.

Model Practicality, Code, and Routing 10:30

  • Reviewers praise GPT-5's personality—direct, less sycophantic, better at pushing back.
  • Latency and hallucinations have improved, but some still prefer Cloud Code+Opus for coding.
  • Introduction of a model router in GPT-5 dynamically selects the best sub-model for each prompt.
  • Manual override options allow for faster answers if needed.

Jailbreaking, Customization, and Model Weaknesses 11:49

  • Jailbreak attempts on GPT-5 still succeed with some effort and tricks, demonstrating persistent vulnerabilities.
  • Customization has increased (e.g., chat color options), but some see this as a shift toward mainstream appeal over innovation.

Rival Models and Competitive Landscape 13:34

  • XAI's Tony Woo claims Gro 4 leads on certain benchmarks like ARC AGI.
  • OpenAI seen as a respectful competitor, but other players emphasize smaller team accomplishments and continual rapid model releases.

Cost, Accessibility, and Adoption 14:28

  • Pricing for GPT-5 is significantly lower than competitors: $1.25/million input tokens, $10/million output.
  • Lower price increases adoption and ecosystem growth; Opus 4.1 is much pricier ($15/$75 per million).
  • Improved for computer use agents; GPT-5 passes tasks that stymied GPT-4.0.

Anecdotes, Practical Use, and Critiques 16:17

  • GPT-5 shines in state-of-the-art evals and is hyped for writing quality and affordability.
  • Some large code tasks (automatic refactoring) result in non-functional but "beautiful" code, indicating practical limits.
  • Medical queries are improving, raising concerns and changing doctor-patient dynamics.
  • Discussions on AGI and the socioeconomic impacts of compute costs—debates about future leverage and societal classes.

Model Naming, Backward Compatibility & Disappointment 18:15

  • The new GPT-5 lineup simplifies naming, mapping old model names to new versions.
  • Some users and industry experts consider GPT-5 disappointing, preferring Claude 3.5 or feeling progress is stalling.
  • Critique centers on diminishing returns; future progress may depend on better software scaffolding around raw model power.
  • Memes and humor abound on industry forums about job security and AI’s advancing capabilities.

Final Thoughts & Industry Competition 20:34

  • On some benchmarks, Gro 4 remains ahead (e.g., ARC AGI), but GPT-5 leads most others.
  • Ongoing competition among AI labs is viewed as beneficial to users.
  • Video ends with a call to like and subscribe.