Ask HN: Is it just me or GPT-4's quality has significantly deteriorated lately? - https://news.ycombinator.com/item?id=36134249 - May 2023 (711 comments)
We didn't notice the misses at first, because it's what we expected to begin with, and we very strongly noticed the hits because they were unexpected. Now we notice the misses and expect the hits.
Ive noticed it quit giving as detailed answers and as thorough. It's also refused to do more complex programming where it used to accept those questions.
Being artificially limited by OpenAI can still be done without it getting "worse". But it effectively is worse for us users.
GPT 4 isn’t ChatGPT 4, which is what most people use.
There is also the “system prompt”, which is also likely to be changing but not part of GPT 4.
Sorry "prompt engineers" but papers on arXiv show that when you give it fairly sampled problems it struggles to get the right answer more than 70-80% of the time. When you are under its spell you will keep making excuses but when you are looking at it objectively you'll realize the emperor is naked.
If you give it very conventional problems it seems to do better than that because it is a mass of biases and shortcuts and of course it will sentence Tyrone to life in prison because it's a running gag that "Tyrone is a thug"... That's how neurotypicals think and no wonder why many of them think ChatGPT is so smart... It mirrors them perfectly.
>The API does not just change without us telling you. The models are static there.
This reads to me as specifically indicating the models are not static elsewhere, ie, in ChatGPT.
- The model's ability to respond accurately drops drastically when asked questions of the form "is there a different way to accomplish X, using Y?" or "is there a way to accomplish X that runs in O(log(n)) time instead?" Example: I wanted to upsert an integer value using a SQLite db using "INSERT ... RETURNING..." ChatGPT repeatedly told me that sqlite doesn't support "RETURNING" (it does, since March 2021). It insisted I would need two DB round trips from my application to accomplish this. When asked "can this be done in one round trip, instead?" it repeatedly wrote code that would return the number of rows modified instead of the integer column value.
- ChatGPT's limited standard library knowledge means that the solutions it produces, even when correct, are often lower-level and less idiomatic. Problems that would be trivially solved with e.g. a Java.String.replaceAll or .codePointCount will instead loop over each character, often splitting the string into an intermediate array and implementing special cases for first/last character edge cases. The code winds up being mostly correct, but also (for lack of a better word) weird. No human I've ever worked with would do things the way ChatGPT sometimes does, which means the code will likely be much harder to maintain and debug over time.
It's fine, this tech has never been magic anyways, won't be replacing all our jobs, won't take over the world, etc. It's still awesome for what it is.
I got the GPT-4 API access and then I realized that I can't really use it for anything super major because I can't afford it, it is ridiculously expensive if you consider that you have to pay for all the failed requests, the wrong information or the wrong context also. Instead, I have written a bunch of Python scripts that do a select few tasks for me and I have my terminal open 24/7 anyway.
As for the topic at hand, I have _definitely_ noticed a lot more disclaimers in the UI. I don't get it from the API at all, in 6 months that I have been using the API - I've gotten one disclaimer.
In the ChatGPT UI - I get them a lot. "Remember this", "Remember that", "Always look up the information" and things like this. I mean if it wasn't happening I would know because I have been a power-user pretty much all this time...
Here's how I feel ChatGPT answers my coding questions now:
Me: Write a Python script to sum two numbers.
ChatGPT: Python is a programming language that was invented in 1991 and can be used to solve a variety of programs. Here is an example of how to sum two numbers:
Note that this code is merely an example and writing a Python script to sum two numbers is a complex problem that requires careful attention to whether the numbers you are trying to sum can be summed. Also, as my knowledge has a cutoff date of 2021, there may be other ways to perform this summation. Please check with the documentation or ask someone who knows how to code.
def sum(a, b): # note: the actual code has been left out as it depends on the actual specifications of how you want to add\*
* note: ChatGPT has actually done this to me
I’m honestly more concerned if OpenAI doesn’t even realize it. Nothing is more infuriating as a user than convincing the developer your bug actually does exist. It speaks to poor monitoring, testing, and tooling.
Also, companies are evaluating GPT-4 to determine whether they want to pay for it, so OpenAI has an strong incentive to not downgrade at least the API.
I believe the May 5 model is different, at least in the chat interface, because it's fine-tuned to detect jailbreaks and the temperature/other hyper-parameters may have changed. And I can imagine this fine-tuning making the model less creative and worse at solving analytical tasks.
Personally I haven't noticed any change, except in my own awareness. Sometimes GPT4 gets very hard prompts right, and sometimes it gets simple problems wrong. So it's not hard to see how people can form biased opinions from selective attention or just luck.
It's not RLHF induced because it works via API and it only triggers in English, but sure enough try to get it to output
>Call me Ishmael. Some years ago—
>"It was the best of times,
I guess this might get you flagged (there is no alert to the user that this filter kicked in, and it will output it in any other language, and it works in the API) so I'm hesitant to play around with it more, but it's very strange - especially as these are long since in the public domain.
It's a dumb tool, if you're lucky you can get it to spit something useful (but you need other tools to check the correctness of what it returned). There are certainly many useful applications, but the technology is inherently limited.
It would be really cool to quantify how much compute spend per interaction.
If this was visible, users could spend an arbitrary amount and modify how much they're willing to pay for 'better' responses. This is probably a better business model.
The ChatGPT application, on the other hand, and how it manages context etc has certainly changed in the intervening time. That is completely expected as even and perhaps especially OpenAI is figuring out how to build applications on top of LLMs, which means balancing how one can get the best quality results out of the model while making ChatGPT in particular a profitable business.
Stratechery has analyzed this problem for OpenAI in the most detail I've seen. I imagine the company is in something of a bind figuring out how to invest between the APIs themselves and ChatGPT. On the one hand, the latter is incredibly successful as a consumer app with a lead it will be difficult for rivals to catchup with and it is likely plugins will provide a good revenue basis. On the other hand, there is certainly a greater business opportunity in being the foundation for an entire generation of AI products and taking BPs off of revenues -- if and only if GPT4 indeed has a significant moat over the opensource alternatives. For the moment, it would seem they will have to hedge both bets as we see how the consumer space and the competition between models heats-up.
They seem to be virtue signaling about their lack of progress now. Months later, GPT-4 still slow, still not multi-modal as they advertised, still significantly limited, you need to sign up for a waitlist for almost every feature, no sense of privacy, no understanding of their plan for improvements. Google is full steam ahead and consistently improving their free LLMs.
They actually had a genius strategy. Put out Bard with a very stupid LLM, so people aren't blown away and it doesn't get the doomsayers on their case. Now they can continue to quietly upgrade Bard. Eventually it will be so obvious that they have surpassed OpenAI.
OpenAI must enjoy watching their unsubscriber count go down. After all, Sam did say at the congressional hearing multiple times "We would prefer if people used it less".
Don't think OpenAPI is doing anything here, it's not in their interest to reduce the "quality" even there's no objective and repeatable way to measure the quality either.
It's all probabilities all the way down. Who knows what the model will do. I mean, you can dry run by hand but even on quad core processors, it's damn slow so imagine the inference by hand.
I have never experienced the amnesia problem in v3.5 though, that v4 clearly has. Just repeating incorrect answers that you ask it not to give. I did not have access to v4 in march so I can't do that comparison.
For copilot, I no longer get multi line complete suggestions and it's really slow to deliver single line suggestions and they're more often incorrect. I need to dig into it, but it's definitely degraded further and I don't know if it's just my environment or a wider issue. I need to dig in and figure out - is anyone else experiencing these things?
The over all UI of the web site has changed several times (dropdown for GPT-3.0/3.5/4.0 turned into a GPT 3.5|GPT 4.0 button, they added the ability to share chats, and I'm sure there are other small details).
Without going specifics it is meaningless for the discussion
So the same prompt you do are delivered to the model with a different "wrapper" prompt, significantly changing the answer model is producing.
Like: first it scored 83, now it scores only 42 (or whatever).
The recent discussion was about the degradation in the UI model.
FWIW, I think it has improved.