LLM Zero-to-Hero with Ollama

15 min read1 hour ago

Setup your own private Generative AI environment using Ollama

Introduction

Over the past two years, the difficulty of self-hosting Large Language Models (LLMs) has decreased significantly. In this article, I’ll show you how to set up your own private LLM environment using Ollama.

Note: While LLMs perform best with a GPU, it is still possible to run LLMs on CPU-only systems, albeit much more slowly. All the examples in this article were performed on a 2020 MacBook Air (M1) with 16GB of RAM.

Topics covered in this article:

Why self-host LLMs
Introduction to Ollama
Downloading and running LLMs with Ollama
Creating a custom Ollama model: an emoji speaking space alien
Using Ollama’s API, both native and OpenAI compatible APIs
Setting up a ChatGPT like web interface for Ollama with Open WebUI
Using Ollama as a coding assistant with Aider

Why Run LLMs locally?

There are several key reasons to consider self-hosting LLMs:

Security: The content of your conversations, documents, and source code remains entirely under your control. Your data cannot be used to train commercial models or for other purposes without your knowledge.
Privacy: You can explore topics privately without exposing sensitive information to second or third-party vendors.
Costs: It’s free since all the software used is open-source and no subscriptions are required.
Variety: There is a wide selection of open LLMs available to download. These models come from companies such as Meta, Google, and Microsoft, and include many variations, large or small, general-purpose or domain-specific and third-party fine-tunes.

What is Ollama?

Ollama is a platform that simplifies the process of running large language models (LLMs) privately on your own computers without relying on an external vendor. This provides more control, privacy, and flexibility.

Key Features of Ollama:

Curated model library: Download LLMs that are guaranteed to work seamlessly with Ollama using a single command.
Ease of use: It’s very simple to use, with a familiar Docker like command-line interface for downloading and running models.
Headless: It runs as a background service and has no GUI, which makes it work well on servers.
Cross-platform compatibility: Available on macOS, Windows, and Linux.
API interfaces: Provides both native Ollama and OpenAI compatible API.
Offline Capability: Ollama can function without an internet connection.

Installing Ollama

Installing Ollama on your computer is very simple. Just visit the downloads page, select your operating system, and either download the installer and run it (macOS and Windows), or run the provided command on Linux.

Image of OS selection from the Ollama downloads page

If you are using Windows or macOS, the installation process is straightforward, and similar to installing any typical application. On Linux, the curl-based installation method requires root access, either by being in a root shell or by having sudo privileges.

If you don’t have root or sudo access on your Linux system, you can still install and run Ollama in your home directory using the following commands:

mkdir ~/ollama
curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz
tar -C ~/ollama -xzf ollama-linux-amd64.tgz
OLLAMA_HOST="127.0.0.1:11434" nohup ~/ollama/bin/ollama serve & < /dev/null

If you are using the above method on a shared system, consider changing the port number from the default of 11434 to something unique to you.

To verify that Ollama is running you can check the version; if the Ollama server isn’t running it will let you know.

ollama --version

Example that what it should look like if it’s working:

Checking if the version of Ollama running

At this point, Ollama should be running on your computer and you are ready to download your first large language model.

Downloading and Managing Models

If you are familiar with Docker, Ollama’s command-line interface will feel very similar. To download a model, you simply use the pull command.

But before you can download a model, you have to know what model you want to use. Visit the Ollama Library page and find a model you are interested in. For this tutorial we’ll be using the Llama3.2 model from Meta. This model is small enough (3 billion parameters) to run on a modest computer but will still provide good responses (mostly). After you are comfortable with Ollama, you should spend some time looking around the model library and try other models.

To download the Llama3.2 model, you only need to run the following command:

ollama pull llama3.2

Example pulling the llama3.2 model:

Now that we have a model, we can list the models that are available locally with the list command:

ollama list

Here’s the output of the list command on my laptop, it shows the new model we just pulled down and the other models that were previously downloaded:

Listing the models that have been downloaded with Ollama

Now that we have a model downloaded, we are ready to run it.

Running Models

To run the model, and ask it a question, we can give it a single, non-interactive prompt or start an interactive session where we can have an ongoing string of prompts.

To provide a single non-interactive prompt and receive a response use the run command followed by your prompt:

ollama run llama3.2 "Explain why the sky is blue in a single sentence"

Output from the above command:

Asking a model a single question and exiting

To start an interactive session where you can have a running conversation use the run command without a prompt:

ollama run llama3.2

Example of an interactive chat session:

Multimodal Models

Ollama also supports “multi-modal” models that have vision capability. Vision models can “see” pictures and describe what’s in them. In this example we are going to use Moondream, a small vision model. First, we’ll pull down the model, then we’ll run it and ask it to describe an image. In this example we are going to ask Moondream describe the following image:

Image used to test a multimodal (vison) model

Here are the commands that we’ll use to download, run and ask Moondream to describe the above picture:

ollama pull moondream
ollama run moondream

>>> Describe this picture: /Users/robertm/Documents/picture-123.png

Below is the description Moondream provided for this picture:

Asking the Moondream model to describe a local image file

Yep, that’s correct; it can see 👁️

Checking which models are loaded

You can use the ps command to check what models are loaded into memory, where they are running (CPU or GPU) and how much longer they will stay loaded until automatically removed due to inactivity (5 minutes by default). In the example below, we can see the that the models used in our previous examples are still loaded, and they are both running on the GPU (rather than CPU) and fit in the available RAM:

Using Ollama’s ‘ps’ command to see what models are loaded

Creating custom Ollama models

Ollama lets you create your own custom models to suit whatever purpose you need. It’s not really a new model, or even fine-tuned model, but it lets you take an existing model, provide it with your own set of parameters, and custom system message that instructs it how to behave. This custom model is stored in your local model library where you are able to use it just like the models that you downloaded.

After you have an idea of a custom model that you’d like to make, you’ll need to create an Ollama Modelfile. In the following example, I wanted to make a model that thinks it’s an Alien from outer space that can only communicate using emojis that encode the meaning of their responses.

Creating the Model File

I created a text file named ‘emoji-alien.Modelfile’ with the following contents:

FROM llama3.2
PARAMETER temperature 1
PARAMETER num_ctx 4096
SYSTEM """
You are an alien intelligence capable of understanding human language but restricted to communicate only through the use of emojis.
Your task is to convey nuanced, detailed responses to human questions exclusively using emojis.
You cannot use any text, punctuation, or non-emoji symbols.
Each emoji or sequence of emojis must precisely encode ideas, emotions, objects, events, or abstract concepts relevant to the user's input.
Be creative and intentional in combining emojis to express complex thoughts, ensuring clarity in meaning.

Examples:
USER: Why is the sky blue?
ASSISTANT: 🌞➖🌬️ 🌊🔄🔵👁️
USER: Are there other intelligent species in the universe?
ASSISTANT: 🌌🔭👽➕🤖➕🦎➕💡❓
USER: Do you experience emotions like humans?
ASSISTANT: 👽❤️😡😢🤯🔄🎭
USER: Will humans ever colonize other planets?
ASSISTANT: 👨‍🚀🌍➡️🌕➡️🔴➡️🪐🏠❓

Creating the Model

After you have defined your model file, you need to use the create command to create and register your model. Here is what creating our Emoji Alien model looks like:

Creating a custom model in Ollama with the ‘create’ command

Using the list command, we can see that our new model has been created and is ready to use:

Checking that our newly created model is read to use

Running the Model

We can now run our custom emoji alien model just like any other model. Here is our new Emoji Alien model in action:

Interactive chat session with the Alien we defined in our new model

It worked perfectly. You can of sort of understand what the Alien is saying by looking at the series of emojis and thinking about it (requires imagination). The third question “Will you share your technology with us?” had an interesting answer. The “do not enter” symbol, followed by an exclamation point, a closed door and a mouth zippered shut, make it clear they aren’t going to share their technology with us, which is probably for the best. One thing I noticed is some of the emojis seem to be overlapping, I’m guessing it used these compound emojis to better convey meaning, but I’m not sure how it did that.

After asking an English-speaking model for an explanation on how the compound emojis were possible, I learned that there is a Unicode character called the Zero Width Joiner (ZWJ), which combines separate emojis into a single glyph to emphasize their relationship.

Using the Ollama API

Up to this point, we’ve only interacted with models via a command-line shell. In this section we’ll see how to use the built-in APIs that Ollama provides. Ollama has two API dialects, the native Ollama API, and an OpenAI compatible API that allows scripts and tools that work with OpenAI to work with our local Ollama instance.

Native Ollama API With CURL

The Ollama API endpoint is at http://localhost:11434/api/generate and we can ask an LLM a question using curl (Mac, Linux and WSL). In the following example we are using curl to ask the Ollama endpoint a question:

curl -s http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "What is the capital of Washington State?",
  "stream": false
}'

Setting the stream to “false” means that the response will be provided all at once, rather than one word per line at a time as they are generated, which makes it hard to read. Here’s the response to our question:

{
  "model": "llama3.2",
  "created_at": "2024-10-20T22:33:45.42162Z",
  "response": "The capital of Washington State is Olympia.",
  "done": true,
  "done_reason": "stop",
  "context": [128006, 9125, 128007, 271, 38766, 1303, ...],
  "total_duration": 595981250,
  "load_duration": 39206250,
  "prompt_eval_count": 33,
  "prompt_eval_duration": 219573000,
  "eval_count": 9,
  "eval_duration": 335558000
}

In the example below, we use pipe the output to jq '.response’ so it only outputs the answer:

Native Ollama API With Python

To use the Ollama native API with Python, you’ll need to install the “ollama” module:

pip install ollama

Next, we’ll create a file named “ollama-alien-chat.py” with the following contents:

import ollama

stream = ollama.chat(
    model='emoji-alien',
    messages=[{'role': 'user', 'content': 'Tell me a long story about your world'}],
    stream=True,
)

for chunk in stream:
  print(chunk['message']['content'], end='', flush=True)

Below is the very thought-provoking response our alien provided to our question 😄:

A response from our custom emoji alien using Python against the native Ollama API

OpenAI Compatible AI

Now we are going to do the same thing using the OpenAI compatible API, which is available at http://localhost:11434/v1/chat/completions.

OpenAI Compatible Ollama API with CURL:

Here we repeat our same question with curl against the OpenAI compatible endpoint:

curl -s "http://localhost:11434/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer boguskey" \
    -d '{
        "model": "llama3.2",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": "What is the capital of Washington state?"
            }
        ]
    }'

The output of the OpenAI compatible API is very different, but they both have the same correct answer:

{
  "id": "chatcmpl-948",
  "object": "chat.completion",
  "created": 1729464075,
  "model": "llama3.2",
  "system_fingerprint": "fp_ollama",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of Washington state is Olympia."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 39,
    "completion_tokens": 9,
    "total_tokens": 48
  }
}

In the example below, we use pipe the output to jq '.choices[0].message.content’ so it only outputs the answer:

Using Ollama’s OpenAI compatible API with Curl

OpenAI Compatible Ollama API with Python:

Here we create a file named “ollama-openai-haiku.py” with the following contents:

from openai import OpenAI

client = OpenAI(
    api_key='boguskey',
    base_url="http://localhost:11434/v1"
)

completion = client.chat.completions.create(
    model="llama3.2",
    temperature=1.0,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "Write a haiku about recursion in programming."
        }
    ]
)
print(completion.choices[0].message.content)

Below is the haiku that our prompt generated via the OpenAI compatible API:

Using Ollama’s OpenAI compatible API with Python

ChatGPT like Interface for Ollama with Open WebUI

Up to this point all of our use of LLMs via Ollama have been either via the command-line or scripts. In this section we’re going to add a ChatGPT like web UI to make using Ollama much easier to use and to provide a lot of additional functionality. For this we’ll be using the excellent Open WebUI that provides a ChatGPT like user interface with features not even available with a ChatGPT Plus subscription.

Installing Open WebUI

There are two simple ways to install Open WebUI, the first is as a Python package, and the second is with Docker. You can install it with a single pip command:

pip install open-webui

That will download the package and all its dependencies it needs to run. After all the packages are installed, you can just start it up and start using it. Run the following command:

WEBUI_AUTH=False python open-webui serve

The WEBUI_AUTH=False part of the above command sets an environment variable that tells Open WebUI to disable user authentication. By default, Open WebUI is a multi-user web application that requires user accounts and authentication, but we are just setting it up for personal use, so we are disabling the user authentication layer.

After Open WebUI is up and running you should see the following in the terminal:

Alternate Docker install method

If you would rather install and run Open WebUI via docker and already have docker running on your system, the following single command will setup Open WebUI on your system:

docker run -d -p 8080:8080 -e WEBUI_AUTH=False \
    --add-host=host.docker.internal:host-gateway \
    -v open-webui:/app/backend/data \
    --name open-webui \
    --restart always ghcr.io/open-webui/open-webui:main

Using Open WebUI

After you’ve installed Open WebUI via one of the above methods, open up a new web browser tab and navigate to the following URL:

http://127.0.0.1/8080

You should something that looking like this:

Select a model from the “Select a Model” drop down at the top of the window and set it as a default. Llama3.2:latest was selected and set as the default in the above screenshot. At this point you can just start prompting it like you would ChatGPT.

Open WebUI Settings

If you click on the “User” profile in the lower left of the app, you can access the “settings” interface to customize the system to suit your needs:

The Open WebUI Workspace

The “Workspace” is accessible via the icon in the upper left of the application. It will let you work with models, even create new ones similar to how we did with the Ollama model files in a previous section, let you create a knowledge base of documents that the LLMs can consult, and create a curated set of prompts that can be accessed quickly with a “/” command. You can even create your own tools and functions using Python that the LLMs can call in the Tools and Functions sections:

The Knowledge portion of the workspace allows you to curate collections of documents that you can query with an LLM by using the “#” symbol and picking the collection to consult. You can optimize how to processes documents (chunk size, embedding model, etc.) in the Open WebUI admistrative settings dialog.

A document collection in the Knowledge section of the Open Web UI Workbench.

Pulling in content from web pages

If you want to pull in some information from a web page into the LLM’s context, use a “#” character followed by a URL like this:

Example pulling in the contents of remote webpage into a chat session

Open WebUI will read the content of the URL you provided into the LLM’s context window and use it to answer your questions:

Example, asking questions about content on a remote website

Document Q&A

The best way to use documents that you’ll want an LLM to reference again in the future is to create a document collection in the knowledge section of the workspace as perviously covered, but if you want to quickly pull in a random document into the LLMs context, you can just drag or upload a document into the input field and provide a question the content of the document:

Example asking questions about the contents of a local file

Open WebUI has a lot of features and capabilities that we can’t cover here; spend some time exploring and experimenting with it on your own.

AI Coding Assistant with Aider

The final topic we are going to briefly cover is leveraging Ollama as a local coding assistant. We’ll be using Aider which describes itself as AI pair programming:

“Aider lets you pair program with LLMs, to edit code in your local Git repository. Start a new project or work with an existing Git repo.”

Aider goes beyond simply generating code like ChatGPT in a chat session, it runs directly in your terminal, can read source code files from your project, edit them in place, and even commit changes via Git. Aider performs best with large, powerful models like OpenAI’s GPT-4o or Anthropic’s Claude 3.5 Sonnet, but it also supports local models via Ollama.

In this basic example, we’ll use the lightweight Llama3.2 model, but you’d achieve better results with a larger model specifically trained for coding tasks.

Aider is a Python package that can be installed via pip with the following command:

pip install aider-chat

After it’s installed, switch to your software projects folder and run the following command:

aider --model ollama/llama3.2 --no-show-model-warning

In the example below, I ran it in a simple Python project folder, and it automatically added itself to the .gitignore file so any temporary working files it creates wont the include in a Git commit. I then use the /add command to read in the contents of the hello.py file into the aider session:

Example running Aider in a Python project folder and adding a source file to the context

In the aider shell, there are many commands that are available with the / command. Some of the available commands are shown here:

/ (slash) commands available in the Aider shell

The hello.py source file contains a very simple FastAPI hello-world API example. I want Aider to update it so that it runs as a stand-alone script that uses Uvicorn to run the API.

It will make the requested changes, and show you what it changed in diff format (added lines of code in green):

Example of Aider editing a local source file as requested

Aider will automatically commit the changed files with Git, giving them a descriptive commit message:

Aider making git commits of all the changes it makes to your source files

Since it’s using Git to version the changes it’s making, you can easily revert a change if you don’t like it.

Now that we have our updated hello.py API, let’s run it to see if it works:

Running the simple API that Aider helped us create

It works!

This was just a very basic example, Aider can read in all the code in your project and do a lot more, but you’ll need a larger LLM and (and potentially a system with more GPU horsepower/VRAM) to make this perform well with more complex projects.

Conclusion

In this article, we moved quickly, providing just enough information to get things running before advancing to the next topic. To make the most of what you’ve learned, you’ll need to explore Ollama, Open WebUI, and Aider in greater depth. Additionally, consider experimenting with with other Ollama/OpenAI compatible clients and utilities periodically, as this space is evolving rapidly.

This may not have been a true “Zero-to-Hero” tutorial, but if you followed along and successfully implemented the examples on your own system, you now have a much deeper understanding of this topic than most people who rely solely on a free ChatGPT account, and have your own private Generative AI environment to use.