Discussion about this post

User's avatar
Jan Ibañez's avatar

The Llama 3.1 8B model that I used gives me ~100 tokens/second which is fast. Of course you can use lighter quantized versions of the model which could provide a faster token rate. Another tool I am trying to use is LLM Studio, where I can adjust the GPU allocation which also has positive impact on the token rate.

Expand full comment
Faz's avatar

How long does it take for the LLM to respond per question on your hardware?

Expand full comment
4 more comments...

No posts