Ten students and four faculty members at the University of California at Berkeley formed a group called the Large Model Systems Organization (LMSYS.Org) in the AI research and computer science departments. Here's what they found when gave people the ability to compare two models simultaneously.
The group created an experiment titled the Chatbot Arena -- a custom website where anyone can anonymously chat simultaneously with two models such as Microsoft Bing, OpenAI, ChatGPT, Google Bard, Anthropic, and other AI models.
Once the user forms an opinion about which chatbot's answers they prefer, they are asked to vote for a favorite. After that vote, the person discovers which model they have been communicating with.
The site uses the same large language models (LLMs) that power ChatGPT and others, and repackages the LLMs in a new interface, since companies like OpenAI have made them publicly available. The site also contains smaller models created by individuals. The study was first reported by PCMag.
Hao Zhang, one of the professors at Berkeley leading this effort, told PCMag that the group has steadily added models since April, and around 40,000 people have participated. The idea is to give two AI models the ability to compete for the best response.
The models also have deficiencies, according to the study. The group found that PaLM 2 -- Google’s large language model with multilingual, reasoning and coding capabilities at the heart of the Google Cloud Vertex API -- has several deficiencies when compared with other models evaluated:
Findings from the study suggest that PaLM 2 is more regulated than other models.
In many user conversations, when users ask questions that PaLM 2 is uncertain about or uncomfortable giving an answer to, it is more likely to not give an answer.
Based on a rough estimate, among all the competition in this analysis, PaLM 2 has lost 20.9% because it refused to answer, and it has lost 30.8% in the analysis to chatbots that do not belong to one of the top four (GPT-4, Claude-v1, ChatGPT, Claude-instant-v1) due to a refusal to answer.
As of the writing of the lmsys.org blog posted nearly one month ago, the group found that weak multilingual abilities for PaLM 2 with the current public API chat-bison@001 at Google Vertex API.
PaLM 2 tends to not answer non-English questions, including questions written in popular languages including Spanish, Chinese and Hebrew. Researchers said they were unable to reproduce several multilingual examples demonstrated in the PaLM 2 technical report using the current PaLM 2 versions.
OpenAI’s GPT-4 today tops the list with an Elo rating of 1,225. An Elo rating system is a method to calculate the relative skill level of players in zero-sum games such as chess. This was used to calculate the best AI chat.
According to the study, ChatGPT and Microsoft Bing are the most accessible favorites. The AI model behind Google Bard -- PaLM 2 -- ranks sixth.
Zhang also told PCMag he believes two issues need more attention: data privacy and high-quality data to power the models.
If any AI model can generate its own content using what is available on the web, there will not be an incentive for humans to create new and better content.