I've been diving deep into Large Language Models lately, but frankly, the barrier to entry feels…high. Cloud APIs are great for playing around, but I want control. I want to understand what’s happening under the hood. And that means running models locally. Enter Ollama.
Ollama promises a ridiculously easy way to run open-source LLMs on your machine. I was skeptical – “easy” and “LLM” rarely go together – but after spending an afternoon with it, I'm genuinely impressed. Here’s my experience getting set up and running my first model.
Installation: Seriously, That's It?
The installation process is almost offensively simple. Ollama supports macOS, Linux, and Windows. For a Linux machine, it’s literally a single command:
curl -fsSL https://ollama.com/install.sh | sh
That downloads and installs everything you need. Seriously. No wrestling with dependencies or CUDA drivers.
For MacOS and Windows, Ollama can be installed simply by running a file downloaded from the official site.
Pulling Your First Model: Llama 3.1 is a Good Start
Once installed, you pull models using the ollama pull
command. Ollama has a curated library of models available. I started with llama2
:
ollama pull llama3.1:8b
This downloads several gigabytes (around 4.9GB for Llama 3.1 with an 8B parameter size), so be patient and have a good internet connection. Ollama displays progress nicely, showing the download speed and estimated time remaining.
Running the Model: Chatting with Llama 3.1
After the download completes, running the model is equally straightforward:
ollama run llama3.1:8b
This launches an interactive chat session in your terminal. You can now type prompts and get responses from Llama 3.1! I started with a simple question:
> Where is Singapore located?
Singapore is a small island city-state located at the southern tip of the Malay Peninsula in Southeast Asia. It is situated:
* Off the south coast of Malaysia, separated by the Strait of Malacca
* At the northern end of the Riau Islands archipelago in Indonesia
* Approximately 137 kilometers (85 miles) north of the equator
Singapore shares a border with only one country: Malaysia, to its north.
The island nation is connected to Malaysia via two causeways:
1. The Johor-Singapore Causeway connects Singapore to the Malaysian state
of Johor.
2. The Tuas Second Link connects Singapore to the southern part of Johor.
Singapore's strategic location has made it a major hub for trade, finance, and transportation in Southeast Asia. Its unique blend of Asian culture and Western influence has also made it one of the most cosmopolitan cities in the world.
Not groundbreaking, but it works. And it's running entirely locally!
A Quick Test with Code: Python & Ollama
Ollama isn’t just for chat. You can also interact with models programmatically using its REST API. When running a model, Ollama opens up the API port at 11434.
Here’s a quick Python example:
import requests
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3.1:8b",
"prompt": "Write a short poem about coding.",
}
)
print(response.json()['response'])
This sends a prompt to the Llama 3.1 model and prints the generated text. It's incredibly easy to integrate into your existing projects.
By the way here is the output poem from the model:
Here is a short poem about coding: Lines of code, a wondrous sight Zeros and ones, in endless flight I type and think, with fingers fast Creating worlds, that will surely last Variables declared, loops to spin Functions called, with logic within Errors corrected, with each new test The beauty of code, I love the best In silicon halls, my mind does roam Where 1s and 0s become a digital home.
It’s sublime if you ask me!
First Impressions & Caveats
Ollama is remarkably easy to use. It abstracts away a lot of the complexity involved in running LLMs locally, making it accessible even for those without deep machine learning expertise.
However, there are some things to keep in mind:
Hardware Requirements: Running LLMs requires significant resources. I have a reasonably powerful MacBook Pro M4 Max and Llama 3.1 with 8B parameter size runs exceptionally well. Larger models will need more RAM and potentially a GPU. So make sure you use an appropriate sized model or a powerful machine!
Model Size: The downloaded model files take up considerable disk space.
What’s Next?
I plan to experiment with different models on Ollama, explore its API further, and start looking at fine-tuning options. This is just the first few steps in my journey to understand LLMs from the ground up. Stay tuned for more updates!
Resources:
The Llama 3.1 8B model that I used gives me ~100 tokens/second which is fast. Of course you can use lighter quantized versions of the model which could provide a faster token rate. Another tool I am trying to use is LLM Studio, where I can adjust the GPU allocation which also has positive impact on the token rate.
How long does it take for the LLM to respond per question on your hardware?