For years, my world has revolved around mobile app development – crafting experiences for iOS and Android. But recently, I've been pulled down a fascinating rabbit hole: Large Language Models (LLMs) and the broader field of AI. It’s a new frontier, and today, I hit a significant milestone in my learning journey!
The Goal: Run Meta's Llama 4 Scout 17B-parameter, 16-expert model locally on my machine.
Why? Because understanding these models isn’t just about using APIs; it’s about getting hands-on and seeing what’s possible. Plus, the recent advancements in open-weight models like Llama 4 are incredibly exciting – a real shift towards more accessible AI development. Meta is really pushing this with Llama 4 Scout and Maverick being their first open-weight natively multimodal models!
My Setup: Apple M4 Macbook Pro (14-inch, 16-core CPU, 40-core GPU, 128GB Unified Memory).
The Challenge: Fitting it All In
Running a 17B parameter model isn’t trivial. Memory was my biggest concern. Thankfully, the quantized 4-bit version of Llama 4 Scout (available on Hugging Face) came to the rescue! This reduced the memory footprint to around 62GB, making it feasible for my machine.
The Process: A Step-by-Step Guide (for fellow engineers)
Here’s how I got it up and running:
Environment Setup:
conda create -n temp python=3.11 anaconda
conda activate temp
conda install git pip
# Commit f5b6f4b31323770f37ea51fea4f994a7ae584733
pip install git+https://github.com/Blaizzy/mlx-vlm.git@main
Model Conversion (already done for me, thankfully!): The model was already converted to MLX format from meta-llama/Llama-4-Scout-17B-16E-Instruct using mlx-vlm version 0.1.21. This saved me a lot of initial headache.
Running the Model:
mlx_vlm.generate --model mlx-community/Llama-4-Scout-17B-16E-Instruct-4bit --max-tokens 100 --temperature 0.0 --prompt "hello"
The Results: Surprisingly Fast!
And… it worked! The model responded with: “Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?”
But the real surprise was the speed. I was getting around 43 tokens/sec for generation – significantly faster than the ~20 tokens/sec I typically see with Gemma 3 27B! This really highlights the performance benefits of running models optimized for Apple’s Metal architecture via MLX.
A Multilingual Test: Tagalog, My Mother Tongue
Llama 4 is designed to be multilingual, trained on a massive dataset including over 100 languages with over 1 billion tokens each! As a native Tagalog speaker, I had to put it to the test.
I asked it why it speaks Tagalog, and its response (translated below) was remarkably accurate and cohesive:
In Filipino: “Ako ay isang modelo ng wika na dinisenyo upang matuto at mag-adjust sa iba't ibang wika, kasama ang Tagalog….” In English: “I am a language model designed to learn and adapt to different languages, including Tagalog….“
It even acknowledged its limitations as a non-native speaker! I was genuinely impressed.
Key Takeaways & Next Steps:
MLX is Powerful: The performance boost with MLX is undeniable.
Quantization is Key: 4-bit quantization makes running large models on consumer hardware feasible.
Open Source LLMs are Thriving: The Llama ecosystem is incredibly vibrant and accessible.
Today, my goal was achieved. I successfully ran Llama 4 Scout locally and verified its multilingual capabilities. My next experiment? Converting GGUF models to MLX! The journey continues…