ChatGPT — With the debut of ChatGPT, a novel AI language model, in late 2022, OpenAI grabbed the globe by storm. The success of the AI service cleared the door for a one-of-a-kind AI race, with hundreds of tech companies vying to imitate it.
While the service has received some criticism, OpenAI has taken the liberty of upgrading it, honing the language model to be as faultless as possible. ChatGPT seems to have found its stride after a couple of revisions.
The most current version of the AI language model pioneer sparked a coin rise, sparking demands to halt development. A new research, however, seems to show that the AI bots may have suffered a setback, leading to a decrease.
Read also: Bitcoin mining claims sparks response from CH4 Capital co-founder
The ChatGPT study
Between March and June 2022, researchers from Stanford and UC Berkeley conducted a study in which they rigorously analyzed multiple versions of ChatGPT. They devised stringent criteria to assess the chatbot’s proficiency in coding, arithmetic, and visual thinking tasks. The outcome of ChatGPT’s performance was not favorable.
According to the results of the testing, there was a worrying reduction in performance between the versions tested. ChatGPT answered 488 of 500 questions correctly during a math challenge regarding prime numbers in March, yielding a 97.6% accuracy rate. By June, the percentage had decreased to 2.4%, with only 12 questions correctly answered.
The deterioration did not become obvious until the chatbot’s software development talents were examined.
“For GPT-4, the percentage of generations that are directly executable dropped from 52.0% in March to 10.0% in June,” the study said.
The findings were discovered using the models’ pure versions, which did not include any code interpreter plugins.
When it came to reasoning, the researchers used visual prompts and a dataset from the Abstract Reasoning Corpus. There was a clear drop, although it wasn’t as high as in math and coding.
“GPT-4 in June made mistakes on queries on which it was correct for in March,” the study said.
Possible reasons for the decline
The drop was unexpected, prompting the question, “What could explain ChatGPT’s painfully obvious downgrades in recent months?” According to a proposed hypothesis, researchers speculate that it could be a side effect of OpenAI’s optimizations.
Another plausible reason is that the adjustments were implemented as a precaution to prevent ChatGPT from responding to harmful inquiries. However, the safety alignment may limit ChatGPT’s use for other activities.
The model, according to the researchers, has a propensity to offer wordy, indirect solutions rather than unambiguous ones.
On Twitter, AI researcher Santiago Valderrama commented, “GPT-4 is getting worse over time, not better.” He also hinted that a cheaper, quicker mix of models may have replaced the original ChatGPT framework.
“Rumors suggest they are using several smaller and specialized GPT-4 models that act similarly to a large model but are less expensive to run,” he noted.
Valderrama also stated that using smaller models might result in speedier answers, but at the expense of less expertise.
“There are hundreds (maybe thousands already?) of replies from people saying they have noticed the degradation in quality,” Valderrama continued. “Browse the comments, and you’ll read about many situations where GPT-4 is not working as before.”
Other insights
After trying to make sense of the data, Dr. Jim Fan, another AI researcher, commented on some of his observations on Twitter. Fan related them to how OpenAI refined its models.
“Unfortunately, more safety typically comes at the cost of less usefulness, leading to a possible degrade in cognitive skills,” he wrote.
“My guess (no evidence, just speculation) is that OpenAI spent the majority of efforts doing lobotomy from March to June, and didn’t have time to fully recover the other capabilities that matter.”
Fan also pointed out that the safety alignment made code needlessly lengthy, mixing in irrelevant material regardless of the prompts.
“I believe this is a side effect of safety alignment,” he offered. “We’ve all seen GPTs add warnings, disclaimers, and back-pedaling.”
Fans said that cost-cutting measures, as well as the introduction of warnings and disclaimers, might have contributed to ChatGPT’s collapse. Furthermore, the absence of widespread community feedback might have been a role. Although more testing is required, the end results corroborated users’ concerns about the diminishing coherence of ChatGPT’s once-highly lauded outputs.
To avoid future deterioration, enthusiasts have advocated for open-source models such as Meta’s LLaMA, which allows for community debugging. They also stressed the need of constant benchmarking in detecting regressions.
Meanwhile, ChatGPT fans should moderate their expectations because the unique and groundbreaking language model AI chatbot appears to have degraded in quality.