Skip to content

Instantly share code, notes, and snippets.

@simonw
Created February 27, 2025 22:36
Show Gist options
  • Save simonw/5e9f5e94ac8840f698c280293d39965e to your computer and use it in GitHub Desktop.
Save simonw/5e9f5e94ac8840f698c280293d39965e to your computer and use it in GitHub Desktop.

Naming Conventions and Model Labelling Confusion

A notably humorous and sarcastic thread runs through the conversation, focusing on the confusing and arbitrary naming conventions of newer LLM versions.

Users mocked the variability and unpredictability of naming schemes:

  • "At this point I think the ultimate benchmark for any new LLM is whether or not it can come up with a coherent naming scheme for itself. Call it 'self awareness.'" — throwup238.
  • "The people naming them really took the 'just give the variable any old name, it doesn't matter' advice from Programming 101 to heart." — lenerdenator.
  • "Still more coherent than the OpenAI lineup." — smallmancontrov.

These humorous observations underscore community frustration with opaque naming conventions that convey little information about the capabilities or differences between models. This resonates especially given the seemingly arbitrary version increments within OpenAI's lineup.

Pricing Concerns and Model Feasibility

Participants express strong skepticism regarding the extremely high pricing structure of GPT-4.5, showing concern and astonishment:

  • "GPT 4.5 pricing is insane: Input: $75.00 / 1M tokens Output: $150.00 / 1M tokens" compared with "GPT 4o Input: $2.50 / 1M tokens Output: $10.00 / 1M tokens". — zaptrem.
  • "I don't understand the value here." — MattSayar.

This skeptical reaction represents broad disbelief about cost-effectiveness, questioning the rationality behind such a large pricing differential compared to previous models and competitor offerings.

Skepticism and Criticism of LLM Performance Improvement Claims

The announcement of GPT-4.5 was met largely with skepticism about genuine leaps forward in capability or practical usefulness:

  • "Looking at pricing, I am frankly astonished.... How could they justify that asking price?... I fail to see how that wasn't possible by prompting GPT-4o." — Topfi.
  • "For example, there are now a bunch of vendors that sell 'respond to RFP' AI products... paying 30x for marginally better performance makes perfect sense." — hn_throwaway_99 (an uncommon opinion supporting possible niche high-cost uses).

Many commenters expressed a general sentiment indicating disappointment and underwhelming reactions after reading OpenAI's release and viewing their livestream, further fueling doubts about incremental product advancement.

"Considering both this blog post and the livestream demos, I am underwhelmed. Having just finished the stream, I had a real 'was that all' moment..." — Topfi.

Model Evaluation Benchmarking Debate

Discussion around benchmarks, coding performance, and evaluation methods occupied a substantial part of the conversation:

  • "That's not a fair comparison as o3-mini is significantly cheaper." — logicchains, responding to a benchmark comparing Claude 3.7 to o3-mini.
  • "Agentic coders (e.g. aider, Claude-code, mycoder, codebuff, etc.) use a lot more tokens, but they write whole features for you and debug your code." — bhouston.

This indicates ongoing debate about how models should be evaluated—by cost, speed, effort, or a combination thereof—and highlights continued uncertainty and contention regarding fairness and practical relevance of different benchmark results.

Dystopian and Social Concerns about AI Features

Several users raised broader societal concerns stemming from increased human-like interactions and AI integration in personal and social contexts:

  • "It’s interesting that they are focusing a large part of this release on the model having a higher 'EQ'... We're far from the days of 'this is not a person', and getting a firm foot on the territory of 'here's your new AI friend'... Personally I find this worrying." — sebastiennight.
  • "Is it just me or is having the AI help you self censor... pretty dystopian?" — nickreese.

These comments reflect unease—not just with technical capabilities—but also potential social ramifications or negative psychological effects from overly-humanized interaction patterns or manipulative emotional engagement, representing a cautious, ethically attentive dimension in evaluating new AI.

Questioning the Limits of Scaling and Future Model Directions

Several insightful comments speculate whether OpenAI has reached practical limits within current scaling approaches:

  • "Finally a scaling wall? This is apparently using about an order of magnitude more compute, and is only maybe 10% more intelligent." — erulabs.
  • "Anthropic appears to be making a bet that reasoning can create a model excellent for all use cases. OpenAI seems to be betting on multiple specialized models working together." — eightysixfour.

These types of reflections point to deeper industry-wide questioning of traditional scaling approaches, as companies push the computational envelope with increasingly unclear returns.


Illustrative Quotes of Less Common Opinions:

  • sebastiennight presents a strategy-oriented understanding, "GPT-4.5 is 2.5x more expensive. I think they announced this as their last non-reasoning model... it was maybe with the goal of stretching pre-training as far as they could, just to see what new capabilities would show up."
  • Amid many negative takes, antirez sees scientific value in the exploration itself: "people are missing what they tried to do with GPT 4.5: it was needed to explore the pre-training scaling law in that direction. A gift to science, however selfish."

These perspectives highlight that there remains nuance—even appreciation—of explorative model advancement not solely focused on immediate commercial viability.


In summary, user opinions coalesce around themes of humorous criticism of naming conventions, frustration at confusion and lack of major capability leaps, disbelief towards pricing, unease with AI's emotional integration, skepticism of benchmarks, and speculation on long-term scaling limitations. While most commentary is skeptical or negative about GPT-4.5, minority voices recognize value in scientific exploration or niche applications.

@simonw
Copy link
Author

simonw commented Feb 27, 2025

@andyrosa2
Copy link

Thank you Simon for your tireless, timely, and thorough efforts.

@clemsau
Copy link

clemsau commented Feb 28, 2025

Thanks for sharing this, Simon!
This is a great use case for LLMs

@youngbrioche
Copy link

Thanks, @simonw, for everything!

@Silverbullet069
Copy link

Silverbullet069 commented Mar 1, 2025

To know and learn from @simonw is the best gift that a tech worker can receive.

@nixilb
Copy link

nixilb commented Mar 12, 2025

To know and learn from @simonw is the best gift that a tech worker can receive.

And we wait for posts as kids for gits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment