Author: Stu Feeser

image image

Full Disclosure!

Read about The Best LLM on the Planet. In this Alta3 repository, an exceptional software engineer, along with a team of dedicated developers, wrote the definitive work on how to deploy llama.cpp on bare metal and cloud-based A-100 or H-100 GPUs. The Ansible playbooks in the repository just work, and they support the Nvidia powerhouse GPUs.

Now, let’s take a look at Ollama. This project is worth watching as it is positioning itself as the “Docker” of AI models.

Apparently, I’m not the only one who thinks llama.cpp is amazing. Let me reference jmorgan’s response to joebiden2 on Hacker News:

joebiden2 comment

What does this add over llama.cpp? Is it just an “easier” way to set up llama.cpp locally?
If so, I don’t really get it, because setting up llama.cpp locally is quite easy and well documented. And this appears to be a fork. Seems a bit fishy to me when looking at the other “top” comments (with this one having no upvotes, but still #2 right now).
(llama.cpp’s original intention is identical to yours: The main goal of llama.cpp is to run the LLaMA model using 4-bit integer quantization on a MacBook¹)

jmorgan response

“The llama.cpp project is absolutely amazing. Our goal was to build with/extend the project (vs. trying to be an alternative). Ollama was originally inspired by the “server” example. This project builds on llama.cpp in a few ways:

  1. Easy install! Precompiled for Mac (Windows and Linux coming soon)
  2. Run 2+ models: loading and unloading models as users need them, including via a REST API. Lots to do here, but even small models are memory hogs and they take quite a while to load, so the hope is to provide basic “scheduling.”
  3. Packaging: content-addressable packaging that bundles GGML-based weights with prompts, parameters, licenses, and other metadata. Later, the goal is to bundle embeddings and other larger files custom models (for specific use cases, a la PrivateGPT) would need to run.”