Battle of the LLMs

Can smaller models take on GPT4?

Written by Jake Bowles, Expert Thinking’s Data & AI Practice Lead.

In this article I will try to compare different LLMs in terms of performance and quality of response.

To conduct the testing I will be using danswer.ai which allows me to query ollama with vector searching built in. In my eyes, this is a good real-world test of each model. I will run everything locally using docker compose.

danswer.ai is effectively a self hosted ChatGPT equivalent that allows for different models to be selected and used. It’s easy to configure and provides an intuitive frontend

To select the model I chose the best level of models that I could run both on my Macbook Pro (M3 Pro) and Ubuntu Desktop (w/ NVIDIA 3070) that were hosted by Ollama. I chose models that were described as “general use” or “converational”. GPT4-Turbo will be used as the OpenAI benchmark, perhaps an unfair comparison (because it isn’t run locally) but relevant in the context of which people should be using for RAG & chatbots.

Models

Phi-2–1.6GB

Mistral — 4.1GB

Llama2–3.8GB

GPT4-Turbo — Not locally hosted

Orca Mini (13b) — 7.4GB

Testing

Firstly I indexed the Expert Thinking website to give danswer.ai some context:

I then pulled all the images into the ollama docker container:

I then prompted all LLMs with the same query: “Tell me about Expert Thinking”.

Mistral

A poor response and struggled to correctly reference the documents.

Phi-2

A pretty good response, but returned it in code format for some reason. Documents were pulled correctly and used as context.

Llama2

This was a tough model to get to respond with any relevance, it chose not to use the context searching and used by far the most compute power to generate responses. The best response I could get was still entirely irrelevant:

I suspect this is a hardware limitation given the CPU intensity that this used.

Orca-mini

Completely irrelevant, note the random document query:

GPT4

No surprises here, the best response by a mile:

So when would you use smaller models?

Lets try some code examples.

I have written some inefficient Python code and I am going to ask the LLMs to refine it.

GPT4

A good baseline. Did the job and provided context§

Mistral

Did the task at hand, returned the response quickly and provided context. Arguably worse code than GPT4, but it improved it.

Orca-mini

Rubbish:

Llama2

The best response in the test in my opinion. Answered the question and provided context:

Phi-2

A very good attempt, better than GPT4 but with less context.

Conclusion

In summary, models shouldn’t be considered a catch all for all use cases. They should be evaluated and tuned to undertake the use case that it is suited for, whether that is conversation, coding or a bespoke task.

OpenAI’s GPT4 is currently the overal best at being a generalist, but it won’t be that way forever and eventually smaller more open-source models will be able to achieve the same result for a lot less cost. Hopefully I have demonstrated in this article the value of testing different models, and how it can lead to quite substantial cost savings considering that the only cost of the models (other than GPT4) is the compute power required, and if my Macbook can run it now, it won’t be long before this is capable by even entry-level hardware.

See how we can help you on the next stage of your cloud journey.

Cube

Cloud
Platforms

Run your applications quickly and reliably

Cloud Data
and AI

Gain visibility and harness the power of AI

Cloud
Migration

Move your business data to the cloud

DevOps and
Automation

Security and automations to protect and optimise