Recent and related:

Ask HN: Is it just me or GPT-4's quality has significantly deteriorated lately? - https://news.ycombinator.com/item?id=36134249 - May 2023 (711 comments)

I think we don't notice our expectations have gone up, and we don't notice that we remember the hits and then expect all hits.

We didn't notice the misses at first, because it's what we expected to begin with, and we very strongly noticed the hits because they were unexpected. Now we notice the misses and expect the hits.

"GPT-4 hasn't gotten worse since March" can be 100% true at the same time OpenAI puts more rules and limiters keeping more interesting answers from being said.

Ive noticed it quit giving as detailed answers and as thorough. It's also refused to do more complex programming where it used to accept those questions.

Being artificially limited by OpenAI can still be done without it getting "worse". But it effectively is worse for us users.

I look at everything someone from OpenAI says as if a politician is saying it. Sam Altman especially is fond of statements that are deceptive but technically true. His employees appear to be following his lead.

GPT 4 isn’t ChatGPT 4, which is what most people use.

There is also the “system prompt”, which is also likely to be changing but not part of GPT 4.


Recent research showed that RLHF/censoring the model hurts the performance of the model. This is intuitively obvious, censorship isn’t real, the data is (moral issues aside). So it hurts the integrity of the weights. The future is open sourced uncensored models, capitalism will demand the high performance. There’s a huge discussion on it on Reddit right now:


My experience using GPT-4 for coding is that it's got the knowledge and skill of a senior engineer, with the high maintenance of a junior engineer. You can get it to output small sections of quality code, but the amount of prodding it takes to piece it all together means you may as well have spent the time writing it yourself. But the future of GPT as a coding assistant is definitely bright. It just needs more chaining, so I feel less like its assistant, asking it to come up with prompts and then pasting them back to it after iterating on the code.
People have just fallen out of love with it and are realizing how it really isn't that good.

Sorry "prompt engineers" but papers on arXiv show that when you give it fairly sampled problems it struggles to get the right answer more than 70-80% of the time. When you are under its spell you will keep making excuses but when you are looking at it objectively you'll realize the emperor is naked.

If you give it very conventional problems it seems to do better than that because it is a mass of biases and shortcuts and of course it will sentence Tyrone to life in prison because it's a running gag that "Tyrone is a thug"... That's how neurotypicals think and no wonder why many of them think ChatGPT is so smart... It mirrors them perfectly.

In the thread yesterday it was brought up that if you use the API it feels considerably less hamstrung than the ChatGPT client, and this tweet seems to fit the assumption that the ChatGPT product is being tuned or governed differently from the API.

>The API does not just change without us telling you. The models are static there.

This reads to me as specifically indicating the models are not static elsewhere, ie, in ChatGPT.

A couple observations from attempting to use both GPT-3.5 and GPT-4 via the web interface for coding tasks:

- The model's ability to respond accurately drops drastically when asked questions of the form "is there a different way to accomplish X, using Y?" or "is there a way to accomplish X that runs in O(log(n)) time instead?" Example: I wanted to upsert an integer value using a SQLite db using "INSERT ... RETURNING..." ChatGPT repeatedly told me that sqlite doesn't support "RETURNING" (it does, since March 2021). It insisted I would need two DB round trips from my application to accomplish this. When asked "can this be done in one round trip, instead?" it repeatedly wrote code that would return the number of rows modified instead of the integer column value.

- ChatGPT's limited standard library knowledge means that the solutions it produces, even when correct, are often lower-level and less idiomatic. Problems that would be trivially solved with e.g. a Java.String.replaceAll or .codePointCount will instead loop over each character, often splitting the string into an intermediate array and implementing special cases for first/last character edge cases. The code winds up being mostly correct, but also (for lack of a better word) weird. No human I've ever worked with would do things the way ChatGPT sometimes does, which means the code will likely be much harder to maintain and debug over time.

OpenAI turned on the "full featured" GPT4 just long enough to learn what people could use it for. Now they turn off those features and use all of those ideas to spawn new companies on a now-private GPT4 Original api and cripple everyone else. When they say "don't worry about building on our API we won't go up the value-chain" -- no shit, but only because they don't want regulatory scrutiny. So they will just trickle-out API access to companies that they own and control through extensive personal networks. Google, Facebook, Twitter, Amazon... Build on someone else's API and they will jam you 100% of the time.
Posted this in yesterday's thread, but once again I think this is just people feeling the magic wear off. People have poked around a lot more and found the flaws while also trying to get it to do real-world tasks. That wasn't true when it first came out.

It's fine, this tech has never been magic anyways, won't be replacing all our jobs, won't take over the world, etc. It's still awesome for what it is.

ChatGPT was a massive dopamine hit, particularly for people in areas like development. It was a tremendous release that has definitely laid out some new tracks for the future, especially on the web. I myself have found to be using ChatGPT a lot less recently.

I got the GPT-4 API access and then I realized that I can't really use it for anything super major because I can't afford it, it is ridiculously expensive if you consider that you have to pay for all the failed requests, the wrong information or the wrong context also. Instead, I have written a bunch of Python scripts that do a select few tasks for me and I have my terminal open 24/7 anyway.

As for the topic at hand, I have _definitely_ noticed a lot more disclaimers in the UI. I don't get it from the API at all, in 6 months that I have been using the API - I've gotten one disclaimer.

In the ChatGPT UI - I get them a lot. "Remember this", "Remember that", "Always look up the information" and things like this. I mean if it wasn't happening I would know because I have been a power-user pretty much all this time...

The actual statement is "The API does not just change without us telling you. The models are static there." But that isn't ChatGPT.

Here's how I feel ChatGPT answers my coding questions now:

Me: Write a Python script to sum two numbers.

ChatGPT: Python is a programming language that was invented in 1991 and can be used to solve a variety of programs. Here is an example of how to sum two numbers:

    def sum(a, b):

        # note: the actual code has been left out as it depends on the actual specifications of how you want to add\*
Note that this code is merely an example and writing a Python script to sum two numbers is a complex problem that requires careful attention to whether the numbers you are trying to sum can be summed. Also, as my knowledge has a cutoff date of 2021, there may be other ways to perform this summation. Please check with the documentation or ask someone who knows how to code.

* note: ChatGPT has actually done this to me

I thought the API was versioned, though it doesn't mean the new versions aren't worse. And he doesn't talk about the model used in chat gpt. I'm a bit skeptical that this answers exactly what we were worried about. Or maybe we were not all worrying exactly about the same thing.
This doesn’t pass the smell test for me. Something has changed. Maybe the model hasn’t changed, but something somewhere is making output worse.

I’m honestly more concerned if OpenAI doesn’t even realize it. Nothing is more infuriating as a user than convincing the developer your bug actually does exist. It speaks to poor monitoring, testing, and tooling.

Who knows? Maybe this engineer doesn’t know what’s going on behind the scenes. I would think the models storage is so tightly guarded only few would have access to even know if something has changed. The ones that know would have nda
I doubt the March 14 model is any different, since it's versioned and I don't see why OpenAI would want to change it behind the scenes.

Also, companies are evaluating GPT-4 to determine whether they want to pay for it, so OpenAI has an strong incentive to not downgrade at least the API.

I believe the May 5 model is different, at least in the chat interface, because it's fine-tuned to detect jailbreaks and the temperature/other hyper-parameters may have changed. And I can imagine this fine-tuning making the model less creative and worse at solving analytical tasks.

Personally I haven't noticed any change, except in my own awareness. Sometimes GPT4 gets very hard prompts right, and sometimes it gets simple problems wrong. So it's not hard to see how people can form biased opinions from selective attention or just luck.

I just discovered when playing around with translations that there is some hidden filter/killswitch that immediately stops the generation of the opening of some books. It doesn't matter if the invoking prompt is to recite the book opening paragraph, or to translate it from a foreign language to English, or what.

It's not RLHF induced because it works via API and it only triggers in English, but sure enough try to get it to output

>Call me Ishmael. Some years ago—

>"It was the best of times,

I guess this might get you flagged (there is no alert to the user that this filter kicked in, and it will output it in any other language, and it works in the API) so I'm hesitant to play around with it more, but it's very strange - especially as these are long since in the public domain.

ChatGPT Plus shows the version. It was May 12, if I remember correctly, then it changed to May 24. Which probably means there were some changes. If not in GPT-4 itself, then in pre or post processing. They should have some safety filters at the end, I think.
It's really fascinating to see all the irrational thinking triggered by ChatGPT. "risks of human race extinction", "soon no more need for developers, doctors or lawyers", "possibility of consciousness emerging", so called experts in "prompt engineering".

It's a dumb tool, if you're lucky you can get it to spit something useful (but you need other tools to check the correctness of what it returned). There are certainly many useful applications, but the technology is inherently limited.

If they are overwhelmed with usage requests and can't buy GPUs fast enough, the model could be the same but the amount of compute spent on the answer could be ramping down impacting the quality.

It would be really cool to quantify how much compute spend per interaction.

If this was visible, users could spend an arbitrary amount and modify how much they're willing to pay for 'better' responses. This is probably a better business model.

Many comments miss the mark as they fail to make the crucial distinction between ChatGPT and GPT-4. GPT-4 is the underlying model one can indeed have direct access to on a pay-per-request pricing scheme. ChatGPT is an application built on top of GPT-4 which manages how the 'context' of your interaction is passed in on a per-request basis. I don't doubt the spokesperson for a minute: from my own experience, the underlying GPT-4 models have not changed and I sincerely believe that OpenAI will be careful on this front, given that they are aiming to build a once-in-a-generation company that provides a stable platform for other firms to build products on top of.

The ChatGPT application, on the other hand, and how it manages context etc has certainly changed in the intervening time. That is completely expected as even and perhaps especially OpenAI is figuring out how to build applications on top of LLMs, which means balancing how one can get the best quality results out of the model while making ChatGPT in particular a profitable business.

Stratechery has analyzed this problem for OpenAI in the most detail I've seen. I imagine the company is in something of a bind figuring out how to invest between the APIs themselves and ChatGPT. On the one hand, the latter is incredibly successful as a consumer app with a lead it will be difficult for rivals to catchup with and it is likely plugins will provide a good revenue basis. On the other hand, there is certainly a greater business opportunity in being the foundation for an entire generation of AI products and taking BPs off of revenues -- if and only if GPT4 indeed has a significant moat over the opensource alternatives. For the moment, it would seem they will have to hedge both bets as we see how the consumer space and the competition between models heats-up.

This is why I unsubscribed from OpenAI.

They seem to be virtue signaling about their lack of progress now. Months later, GPT-4 still slow, still not multi-modal as they advertised, still significantly limited, you need to sign up for a waitlist for almost every feature, no sense of privacy, no understanding of their plan for improvements. Google is full steam ahead and consistently improving their free LLMs.

They actually had a genius strategy. Put out Bard with a very stupid LLM, so people aren't blown away and it doesn't get the doomsayers on their case. Now they can continue to quietly upgrade Bard. Eventually it will be so obvious that they have surpassed OpenAI.

OpenAI must enjoy watching their unsubscriber count go down. After all, Sam did say at the congressional hearing multiple times "We would prefer if people used it less".

There is No War in Ba Sing Se.
This was predictable from the get go and many had had pointed that out already that soon people will start noticing that LLMs aren't as magical as they thought once the initial awe wanes away.

Don't think OpenAPI is doing anything here, it's not in their interest to reduce the "quality" even there's no objective and repeatable way to measure the quality either.

It's all probabilities all the way down. Who knows what the model will do. I mean, you can dry run by hand but even on quad core processors, it's damn slow so imagine the inference by hand.

He said "the API". He didn't say anything about the Chat version, which seems to have more protections that may be on top of the model and not embedded into it.
Are we chatting directly with the model? Maybe the interface has changed. With long term use the likelihood of hitting edge cases is higher and maybe that is a cause as well for what users are seeing. People probably ask more vague questions over time. I might have done that.

I have never experienced the amnesia problem in v3.5 though, that v4 clearly has. Just repeating incorrect answers that you ask it not to give. I did not have access to v4 in march so I can't do that comparison.

I've found it's gotb eyes and have given up using it for tasks more often than I used to.

For copilot, I no longer get multi line complete suggestions and it's really slow to deliver single line suggestions and they're more often incorrect. I need to dig into it, but it's definitely degraded further and I don't know if it's just my environment or a wider issue. I need to dig in and figure out - is anyone else experiencing these things?

I can believe that the core GPT-4 model itself has not changed, but clearly they've changed some features. They've added plug-in access, which can change GPT-4's capabilities greatly with some chats.

The over all UI of the web site has changed several times (dropdown for GPT-3.0/3.5/4.0 turned into a GPT 3.5|GPT 4.0 button, they added the ability to share chats, and I'm sure there are other small details).

Gpt is starting to suck? I'll take the blame for this one. I've been submitting crazy prompts since the beginning in the hopes of confusing it!
Is there a way to use GPT-4 directly? Instead of the muzzled ChatGPT that is supposed to give answers authorities and opinionators deem appropriate?
Unrelated to the model itself but infrastructure, yesterday was unusable for me, getting random "too many requests sorry bout that" errors. I think 1/4 of the requests during a 3 hr period didn't make it through. Imposible to build anything beyond experimental stuff on top of a service so unreliable. Haven't tried it through Azure yet, I wonder if it is any better?
Whatever happens to chat gpt, ai image generation is incredible. It’s so incredibly powerful that I don’t think it’s ever going away
The amount of times per day I have to click "Stop generating" and then say "No!" has definitely increased
Does the author refer to GPT-4 API or the ChatGPT version of GPT-4? The recent discussions on GPT-4 quality deterioration seem to focus on the ChatGPT version rather than the API. Also, since ChatGPT version of GPT-4 is now supporting web browsing and plugins, I would assume it has to have been updated.
Must be in the trough of disillusionment. The tech is truly useful though so we’ll reach the plateau soon enough.
The model didn’t change but doesn’t mean the inference didn’t change.

Without going specifics it is meaningless for the discussion

Model has not changed. Prompt transformation code have changed.

So the same prompt you do are delivered to the model with a different "wrapper" prompt, significantly changing the answer model is producing.

I agree I've noticed I need to be quite specific now, it won't realize bugs in the code unless I tell it so. A head to head comparison of the different versions is needed to validate this.
The model may be the same, but the chatbot is not. They have made the responses shorter to save inference expenses, which are huge in a model the size of GPT-4.
I read somewhere there was a paper they released that as they tuned for alignment, overall quality dropped proportionally. Anyone know the name of it?
He’s talking about the API. Not the web client
To what degree is the pre and post processing of the chat client the source of the confusion rather than GPT-4 itself?
I don’t get why is this being discussed, isn’t it easy to go check old conversations and try to replicate them today?
I use GPT-4 a lot and something has definitely changed in my recent experience. It's simply worse.
Instead of complaining, why not show the benchmarks?

Like: first it scored 83, now it scores only 42 (or whatever).

What can we do to upheave this organization and their blatantly anticompetitive tactics?
Perhaps the evolution of this story is an interesting example of confirmation bias?
This is basically a tacit admission that ChatGPT4 *has* gotten worse.
I understand why people don't read 10,000 word articles, but this is a 15 word tweet. Would it really kill people to read it before commenting? He is very explicitly talking about the API, which uses a different model than the UI.

The recent discussion was about the degradation in the UI model.

Is it static? If I ask it the same question I asked it a few months ago it gave me a completely different answer. Is that because of some additional context above? Should we be starting fresh chats sooner?
GPT 3.5 is a much better coding assistant.
If ChatGPT has been static since March (of 2023) the. Why does my version always change? Right now I'm running May 24.

FWIW, I think it has improved.

Yea that is very tough to believe
Company who has repeatedly lied in public: there’s nothing to see here!