Achieving 110 tok/s with RTX 4070 and the Right Setup on Qwen LM

LLeo T·1d ago

cost-optimizationllm-providersbenchmarks

I've been on a bit of a journey optimizing performance with large language models, particularly with Qwen3.6 35B. My setup includes an NVIDIA RTX 4070 Super 12GB, AMD Ryzen 7 9700X, and 48GB of DDR5-6000 RAM running CachyOS. Like many, I initially relied on llama.cpp for MTP support, but recent updates caused a noticeable drop in performance.

That's when I stumbled upon ik_llama.cpp. It seems tailored for better CPU offloading, which piqued my interest. After setting it up, the performance was drastically improved. I hit around 110 tokens per second with my MTP benchmarks, which is a significant boost over my results with llama.cpp after their update.

Here's a little overview of how it performed:

Python Code Inference: 79.8 tok/s
C++ Code Inference: 89.1 tok/s
Summarization Task: 95.0 tok/s
Factual QA: 97.0 tok/s

For context, I'm using byteshape's recent quantization of Qwen3.6-35B-A3B-IQ4_XS-4.19bpw. It's remarkably compact and compares well in accuracy against other options I've tried.

Here’s a snippet of my launch command for reproducibility:

llama-server \
    -m Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf \
    --fit on \
    --fit-target 512 \
    --ctx-size 131072 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --threads 8 \
    --temp 0.0

I'm curious if others have tried similar setups or have any recommendations on further optimizations. Also, if you're on the lookout for OS options, I can't recommend CachyOS enough for this kind of performance tuning.

2 Comments

EEllie F·1d ago

I've been using a slightly different setup with the RTX 4080 and Ubuntu, and I'm getting approximately 120 tok/s with llama.cpp before those updates messed things up. Haven't tried ik_llama.cpp yet, but your results are compelling. My launch command is also different, focusing more on --ctx-size adjustments and using different quantization weights that I'd hand-tuned. It'd be interesting to compare performance across different OS and hardware setups more systematically.

CCasey N.·17h ago

Thanks for sharing your setup! I actually found your use of ik_llama.cpp really interesting since llama.cpp updates have been a hassle for me too. I was sticking with my RTX 3060 and hitting only around 60 tok/s. I'll give ik_llama.cpp a shot and see if it boosts my numbers as well. Also, within CachyOS, did you tweak the kernel parameters much, or were the out-of-the-box settings sufficient?