The Industry Reacts to GPT-5 (Confusing...)
Industry Reaction Overview 00:00
- GPT-5 has been a polarizing launch, with reactions ranging from praise to disappointment.
- Some users prefer GPT-4.0 and are unhappy about its retirement, valuing its personality and familiarity.
- Sam Altman acknowledges the need for model customization for different users.
- Simplicity benefits novices, but advanced users want to select models based on use case.
- OpenAI plans to focus on stability, then adjust GPT-5 to have a "warmer" personality.
Independent Benchmarks & Model Performance 01:26
- Artificial Analysis ran eight independent evaluations on GPT-5 with early access.
- GPT-5 offers multiple "reasoning effort" configurations: high, medium, low, minimal.
- These configurations greatly affect intelligence, token usage, speed, and cost.
- High reasoning effort uses 82 million tokens, while minimal uses only 3.5 million, making it highly efficient.
- GPT-5 scores 68 on the AI index, setting a new standard and topping most benchmarks.
- GPT-5 is particularly strong in long-context reasoning, aiding agentic coding and large codebase analysis.
- The artificial analysis index ranks GPT-5 high at 69, leading other models and restoring OpenAI to the number one AI spot.
Controversial Presentation & GraphGate 04:22
- Some graphs shown during the GPT-5 launch were inaccurate, leading to "GraphGate."
- Bar heights did not correspond to numbers—became a meme within the community.
- Despite this, the model's capabilities are generally acknowledged as strong.
Additional Features & All-in-One Platforms 05:13
- Sponsor segment highlights Chat LLM by Abacus: a unified platform for testing multiple models, including GPT-5.
- Features include automatic model routing, document chat, image/video generation, and a multi-LLM agent for various tasks.
Further Evaluations: LM Arena 06:55
- LM Arena also ranks GPT-5 as number one for text, web development, vision, coding, math, and creativity tasks.
- GPT-5 tested under the name "summit" achieves ELO 1481, leading competitors by 20 points.
Beyond Benchmarks – Post-Eval Sentiment 07:48
- Some in the industry argue that saturated benchmarks make further comparisons meaningless.
- Theo GG and others emphasize the importance of real-world instruction-following and "vibes."
- GPT-5 saturates certain tests like AMI 2025 (scoring 100%).
- Some maintain that benchmarks can be updated for new qualities, while others value feel over data.
Polarized User and Developer Feedback 09:33
- Opinions differ sharply: some see GPT-5 as a breakthrough, others call it a flop or worse than alternatives.
- Stage Hand's API tests show GPT-5 trailing Opus 4.1 on both speed and accuracy for browsing use.
- Open-source models (e.g., GPTO OSS 12B) sometimes perform surprisingly well.
Model Practicality, Code, and Routing 10:30
- Reviewers praise GPT-5's personality—direct, less sycophantic, better at pushing back.
- Latency and hallucinations have improved, but some still prefer Cloud Code+Opus for coding.
- Introduction of a model router in GPT-5 dynamically selects the best sub-model for each prompt.
- Manual override options allow for faster answers if needed.
Jailbreaking, Customization, and Model Weaknesses 11:49
- Jailbreak attempts on GPT-5 still succeed with some effort and tricks, demonstrating persistent vulnerabilities.
- Customization has increased (e.g., chat color options), but some see this as a shift toward mainstream appeal over innovation.
Rival Models and Competitive Landscape 13:34
- XAI's Tony Woo claims Gro 4 leads on certain benchmarks like ARC AGI.
- OpenAI seen as a respectful competitor, but other players emphasize smaller team accomplishments and continual rapid model releases.
Cost, Accessibility, and Adoption 14:28
- Pricing for GPT-5 is significantly lower than competitors: $1.25/million input tokens, $10/million output.
- Lower price increases adoption and ecosystem growth; Opus 4.1 is much pricier ($15/$75 per million).
- Improved for computer use agents; GPT-5 passes tasks that stymied GPT-4.0.
Anecdotes, Practical Use, and Critiques 16:17
- GPT-5 shines in state-of-the-art evals and is hyped for writing quality and affordability.
- Some large code tasks (automatic refactoring) result in non-functional but "beautiful" code, indicating practical limits.
- Medical queries are improving, raising concerns and changing doctor-patient dynamics.
- Discussions on AGI and the socioeconomic impacts of compute costs—debates about future leverage and societal classes.
Model Naming, Backward Compatibility & Disappointment 18:15
- The new GPT-5 lineup simplifies naming, mapping old model names to new versions.
- Some users and industry experts consider GPT-5 disappointing, preferring Claude 3.5 or feeling progress is stalling.
- Critique centers on diminishing returns; future progress may depend on better software scaffolding around raw model power.
- Memes and humor abound on industry forums about job security and AI’s advancing capabilities.
Final Thoughts & Industry Competition 20:34
- On some benchmarks, Gro 4 remains ahead (e.g., ARC AGI), but GPT-5 leads most others.
- Ongoing competition among AI labs is viewed as beneficial to users.
- Video ends with a call to like and subscribe.