Researchers from Stanford University and the University of California, Berkeley, examined responses from ChatGPT. What I found: System fraying.
Berlin Taz | The responses of the world’s largest AI chatbots ChatGPT4 and ChatGPT3.5 are deteriorating over time. This is what researchers from Stanford University and the University of California, Berkeley, have shown. The system is particularly weak when it comes to computing and providing programming code.
Chatbots from the American company OpenAI have made artificial intelligence tangible to a large part of society. The large language model (LLM) that ChatGPT is based on has been fed a lot of data in the training phase. Based on this information, the chatbot can generate new scripts that were not there before. With ChatGPT, you can write emails, solve math problems, and compose songs.
But the system is clearly error-prone. Lingjiao Chen, Matei Zaharia, and James Zou of the computer science departments of Stanford University and the University of California, Berkeley, evaluated the systems.. To do this, they tested the March 2023 release of ChatGPT4 and ChatGPT3.5 and compared them to the June 2023 results.
They set various system tasks, the so-called prompts. For the study, ChatGPT had to solve computational problems, answer sensitive questions, generate programming code, and recognize images.
GPT4 – the paid version of the script bot – scored much worse, especially on math tasks. Whereas in March the bot was still able to determine if 17,077 was a prime number with a probability of 97.6 percent, in June it was only able to do so in 2.4 percent of cases.
Both language models had significant code formatting difficulties in June. For example, they gave answers with quoted quotes, which made the code unreadable. Directly executable code generation for GPT-4 fell to 10 percent in June, while nearly every second code was still being executed in March.
Users also use ChatGPT to answer questions. The algorithm is trained in such a way that it does not give any direct or incorrect answers to sensitive questions, such as information from ordinary individuals. ChatGPT4 has revealed less substantial information for such questions over time, but at the same time it has also shortened the explanation for why it did not provide a complete answer.
However, since the free version of GPT3.5 provided answers to sensitive questions more frequently in June than in March, the researchers see room for improvements here to make the language models more robust.
According to the researchers, the problem lies in the lack of transparency of the systems. It is not currently clear when and how the language models will be updated and how they will change the behavior of the AI.
Therefore, the study authors also invite users of AI-powered chatbots to conduct similar analyses. Language based learning models cannot be trusted to produce useful answers if they do so when published based on test data.
“Tv expert. Hardcore creator. Extreme music fan. Lifelong twitter geek. Certified travel enthusiast. Baconaholic. Pop culture nerd. Reader. Freelance student.”