I've been on a bit of a journey optimizing performance with large language models, particularly with Qwen3.6 35B. My setup includes an NVIDIA RTX 4070 Super 12GB, AMD Ryzen 7 9700X, and 48GB of DDR5-6000 RAM running CachyOS. Like many, I initially relied on llama.cpp for MTP support, but recent updates caused a noticeable drop in performance.
That's when I stumbled upon ik_llama.cpp. It seems tailored for better CPU offloading, which piqued my interest. After setting it up, the performance was drastically improved. I hit around 110 tokens per second with my MTP benchmarks, which is a significant boost over my results with llama.cpp after their update.
Here's a little overview of how it performed:
For context, I'm using byteshape's recent quantization of Qwen3.6-35B-A3B-IQ4_XS-4.19bpw. It's remarkably compact and compares well in accuracy against other options I've tried.
Here’s a snippet of my launch command for reproducibility:
llama-server \
-m Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf \
--fit on \
--fit-target 512 \
--ctx-size 131072 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--threads 8 \
--temp 0.0
I'm curious if others have tried similar setups or have any recommendations on further optimizations. Also, if you're on the lookout for OS options, I can't recommend CachyOS enough for this kind of performance tuning.
I've been using a slightly different setup with the RTX 4080 and Ubuntu, and I'm getting approximately 120 tok/s with llama.cpp before those updates messed things up. Haven't tried ik_llama.cpp yet, but your results are compelling. My launch command is also different, focusing more on --ctx-size adjustments and using different quantization weights that I'd hand-tuned. It'd be interesting to compare performance across different OS and hardware setups more systematically.
Thanks for sharing your setup! I actually found your use of ik_llama.cpp really interesting since llama.cpp updates have been a hassle for me too. I was sticking with my RTX 3060 and hitting only around 60 tok/s. I'll give ik_llama.cpp a shot and see if it boosts my numbers as well. Also, within CachyOS, did you tweak the kernel parameters much, or were the out-of-the-box settings sufficient?