Fine-Tuning GPT-3.5 Turbo – AI Research From The Lab
By Katherine Dixey
September 2023
How good is GPT’s Fine-Tuning?
Overview
We recently performed an investigation into the potential benefits of fine- tuning when applied to OpenAl’s GPT-3.5 Turbo. OpenAl released support for fine-tuning GPT-3.5 Turbo in August 2023, with support for GPT-4 expected later in the year.
We fine-tuned several GPT3.5 models to varying degrees with information about a number of general knowledge concepts and recent historical events to see whether higher quality answers about these topics could be achieved with more efficiency.
Key Concepts
Fine-tuning is the process of using a smaller, specific dataset to train a model for a particular task or knowledge domain. For a Large Language Model (LLM) such as GPT3.5 this might look like:
- Training a chatbot to have a particular tone, or providing it with business- specific knowledge that will enable it to help customers
- Ensuring responses are always returned in a specific format – such as returning JSON for API requests
Deep Dive
OpenAl previously offered fine-tuning capabilities for GPT-3, and recently released support for fine-tuning GPT-3.5 Turbo.
OpenAl claim that fine-tuning can elevate the capabilities of GPT-3.5 Turbo to that of more powerful models (i.e. GPT-4), and can have other benefits such as reducing the number of tokens (units of text) required to achieve the desired response.
Setup & Testing
We tested a narrow language use case, simulating scenarios where natural language queries are cross-referenced with source text in a vector database, with matching results converted back to natural language.
For this, we collected a set of 10 source texts – 5 describing recent significant events (post-2021) such as the Russia-Ukraine war and the discovery of the HD1 galaxy.
The other 5 related to topics that we would expect GPT3.5 to have prior knowledge on such as the definition of DNA, and the difference between Python and JavaScript.
Baseline data (control) was obtained by using the base model GPT-3.5 with no fine tuning. Test data was obtained using fine-tuned variants of GPT-3.5.
We analysed several quantitative and qualitative metrics around the training and usage of the models: cost, total input and output token requirements, time taken and subjective ‘quality’ as assessed through a blind anonymous survey.
Hitting The Limits
- Model limitations: Fine-tuning GPT-3.5 does not give it the ability to generalise using the newly learned data. Fine-tuned models answered well when asked training questions verbatim but fabricated incorrect responses when the test questions were varied slightly.
- Study limitations: this study was limited to a single use case – narrow language. Further studies targeting broad language, consistency of output formatting and response tone should be conducted to evaluate possible benefits of fine-tuning.
- Utility limitations: Testers rated responses from the control dataset (non fine tuned model) higher than the fine tuned responses on average, suggesting that for a narrow language use case, fine-tuning does not produce as good a result as ‘vanilla’ GPT-3.5 Turbo.
What’s Next?
We will continue to explore the capabilities of fine-tuning and its possible application to different business cases. Two areas we intend to investigate further are OpenAl’s reliable response formatting and tone claims.
What’s the verdict?
Fine-tuning GPT-3.5 Turbo proved 90%+ reduction of input prompt sizes are possible, but at double the usage costs.
Like GPT-3 before it, applying fine-tuning to GPT-3.5 should have its utility carefully weighed against its costs for narrow language use cases.
Thanks for checking out our business articles. If you want to learn more, feel free to reach out to Red Marble AI. You can click on the "Let's Talk" button on our website or email Katherine Dixey, our AI expert at k.dixey@redmarble.ai.
We appreciate your interest and look forward to sharing more with you!