extr
I've noticed that when it refuses to answer it's good to "get it talking" about related subject matter, and then try to create a smooth transition toward whatever you wanted it to say/do.
dzink
The only way to do alignment long term would be to have a policing model watching the new models, because no human will be able to keep up with all corner cases as they grow exponentially. l
runnerup
I’d figure it may generally be possible to reverse the actors here and get GPT3.5 to jailbreak GPT4 as well. For now, “offense” seems much easier than defense.
yeldarb
If GPT-4 is talking to another instance of itself vs 3.5 are the results similar? Or is it only good at fooling a less capable version?
zxcvbn4038
This is good to see. I spent a couple weekends playing with ChatGPT and I found it is very sensitive to wording. One word gets you a lecture that it is just AI language model and can't do this or that, use an synonym and it happily spews pages of results. In another situation I asked chatgpt to summarize information from an article it cited that had been deleted - and it refused because the rights holder might have deleted the article for a reason. I told it the article had been restored by the author and it produced a summary. Mentioning Donald Trump by name often gets you lectured about controversial subjects, "45th president" does not. And so on.
mdale
The real test is the other way around ;) ... will smaller models / less compute be able to subvert larger models with larger compute ? As they get more complex and have more connected systems that would be problematic I think.
sr.ht