Emerging LLMs: Evaluating the Contenders
Overview
Our research team recently investigated the current state of the art in GPT technology and evaluated the performance of 7 large language models (LLMs).
We then did a deep dive into two interesting competitors to Open Al’s models – Google Palm and the Falcon 40B LLM.
We tested them against a series of complex workflows from our client work, including two examples (python code generation, working with JSON records within a business workflow) where we have found OpenAl struggles for accuracy.
Key Concepts
Google Al has developed several GPT models, including PALM2 and Flan-T5, which can be accessed through their Vertex AI API and Model Garden.
FALCON is a family of language models developed by the Technology Innovation Institute (TII) in Abu Dhabi, which includes Falcon-40B and Falcon-7B. Falcon is particularly interesting as it can be run locally on a relatively low powered GPU on a local server, removing some data security and governance risks.
Deep Dive
LLMs work by predicting which word (or a part-word called a token) is likely to come next in a particular context. The initial training or subsequent fine tuning of the model relates to adjusting the connections (called weights) between words in order to better predict the next word within a context.
The “40B and 7B” descriptors in the Falcon model relates to the number of parameters in the model which includes these weights.
Setup & Testing
We ran Falcom 40B and 7B via a hosted version on HuggingFace, a platform which hosts machine learning models. Falcon models can be easily integrated into existing workflows using the Hugging Face ecosystem.
We ran Google Palm2 (their ChatGPT equivalent) via the Vertex AI APIs. Google has also integrated GPT into their cloud-based machine learning platform, making it easy for businesses to deploy and use GPT models.
Hitting The Limits
OpenAI GPT 4 is still a leader based on the quality level and detail of the
generated text – where the reinforcement learning with the number of users (feedback) plays a significant role but is relatively slow in response time and lacks reasoning (which OpenAl is working on).
The main competitor rising is Google Al – Palm 2 which has very strong logical reasoning capability and around 4x quicker in response time. The ‘code-bisom’ model performed worse in both direct code generation and in complex problem solving than the ‘text-bisom’ model (which over-performed GPT4 in some tasks). Negatives include the need for more precise prompts and there’s insufficient reliability to build into business automation.
FALCON shows that there is an absolutely accessible and free GPT 4 level model, with quick response time, can be run locally and has fine-tuning capability. It is early days though and we failed to get adequate responses on some tests in particular code generation and some complex query interpretation.
What’s Next?
We will install Falcon locally and fine tune as part of another research paper. This will enable us to adjust the weights in the model and explore its utility within a specific domain.
What’s the verdict?
We will continue to implement OpenAI based models in our client engagements and will monitor progress of these and other models.
Thanks for checking out our business articles.
If you want to learn more, feel free to reach out to Red Marble AI.
You can click on the "Let's Talk" button on our website or email Dave, our AI expert at d.timm@redmarble.ai.We appreciate your interest and look forward to sharing more with you!