Diving into Llama 4

My First Successful Run with Llama 4 Scout

Apr 10, 2025

For years, my world has revolved around mobile app development – crafting experiences for iOS and Android. But recently, I've been pulled down a fascinating rabbit hole: Large Language Models (LLMs) and the broader field of AI. It’s a new frontier, and today, I hit a significant milestone in my learning journey!

The Goal: Run Meta's Llama 4 Scout 17B-parameter, 16-expert model locally on my machine.

Why? Because understanding these models isn’t just about using APIs; it’s about getting hands-on and seeing what’s possible. Plus, the recent advancements in open-weight models like Llama 4 are incredibly exciting – a real shift towards more accessible AI development. Meta is really pushing this with Llama 4 Scout and Maverick being their first open-weight natively multimodal models!

My Setup: Apple M4 Macbook Pro (14-inch, 16-core CPU, 40-core GPU, 128GB Unified Memory).

The Challenge: Fitting it All In

Running a 17B parameter model isn’t trivial. Memory was my biggest concern. Thankfully, the quantized 4-bit version of Llama 4 Scout (available on Hugging Face) came to the rescue! This reduced the memory footprint to around 62GB, making it feasible for my machine.

The Process: A Step-by-Step Guide (for fellow engineers)

Here’s how I got it up and running:

Environment Setup:

conda create -n temp python=3.11 anaconda
conda activate temp
conda install git pip

# Commit f5b6f4b31323770f37ea51fea4f994a7ae584733 
pip install git+https://github.com/Blaizzy/mlx-vlm.git@main

Model Conversion (already done for me, thankfully!): The model was already converted to MLX format from meta-llama/Llama-4-Scout-17B-16E-Instruct using mlx-vlm version 0.1.21. This saved me a lot of initial headache.
Running the Model:

mlx_vlm.generate --model mlx-community/Llama-4-Scout-17B-16E-Instruct-4bit --max-tokens 100 --temperature 0.0 --prompt "hello"

The Results: Surprisingly Fast!

And… it worked! The model responded with: “Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?”

But the real surprise was the speed. I was getting around 43 tokens/sec for generation – significantly faster than the ~20 tokens/sec I typically see with Gemma 3 27B! This really highlights the performance benefits of running models optimized for Apple’s Metal architecture via MLX.

A Multilingual Test: Tagalog, My Mother Tongue

Llama 4 is designed to be multilingual, trained on a massive dataset including over 100 languages with over 1 billion tokens each! As a native Tagalog speaker, I had to put it to the test.

I asked it why it speaks Tagalog, and its response (translated below) was remarkably accurate and cohesive:

In Filipino: “Ako ay isang modelo ng wika na dinisenyo upang matuto at mag-adjust sa iba't ibang wika, kasama ang Tagalog….” In English: “I am a language model designed to learn and adapt to different languages, including Tagalog….“

It even acknowledged its limitations as a non-native speaker! I was genuinely impressed.

Key Takeaways & Next Steps:

MLX is Powerful: The performance boost with MLX is undeniable.
Quantization is Key: 4-bit quantization makes running large models on consumer hardware feasible.
Open Source LLMs are Thriving: The Llama ecosystem is incredibly vibrant and accessible.

Today, my goal was achieved. I successfully ran Llama 4 Scout locally and verified its multilingual capabilities. My next experiment? Converting GGUF models to MLX! The journey continues…

The Curious Coder

Discussion about this post