Been experimenting with training a 70B parameter model on my RTX 4090 and wanted to share some findings. Initially thought this was impossible, but with the right combination of techniques, I'm actually making progress.
Here's what's working for me:
Memory usage breakdown:
Training is obviously slow (about 0.3 tokens/sec), but for fine-tuning experiments or small datasets, it's actually viable. The electricity cost is around $2-3 per day vs $50+ for cloud GPUs.
Anyone else trying similar setups? Would love to hear about other memory optimization tricks that work well with single-GPU setups.
I've tried something similar on my GTX 1080 Ti, though not as intense as a 70B parameter model. I'm using FP16 instead of bfloat16 because of hardware limitations, and combined with ZeRO Stage 2, I managed to train smaller models effectively. For me, activation offloading helped save roughly 10GB of VRAM during peaks. Curious about the practical differences you've seen between mixed precision options – have you tested pure FP16 out of interest?
Check out FairScale's FSDP implementations as well. I've had success with it on a much smaller scale model. I found squeezing model inference into 4GB VRAM with some degree of sacrifice in speed, but it worked! Would love to exchange thoughts on how offloading affects model convergence—do you notice it impacting the final model's performance?
I've also been using DeepSpeed on my single GPU setup! Instead of using activation offloading, I've been compressing my model weights and gradients with a technique called 'weight quantization.' It reduced memory footprint significantly, though it does impact model precision a bit. Curious if anyone else has tried quantization on large models?
Nice work! I tried something similar last month but gave up after my training kept getting killed by the OOM killer. Your memory breakdown is really helpful - I think I was underestimating how much the optimizer states blow up even with partitioning. One thing that helped me squeeze out a bit more memory was using gradient accumulation with a ridiculously small micro-batch size (like 1) and accumulating over 64+ steps. Also found that pinned memory allocation can sometimes help with the CPU<->GPU transfers, though YMMV. The 0.3 tok/s is actually not terrible for experimentation - beats waiting in Lambda Labs queues!
This is awesome! I've been trying similar stuff on my 3090 but kept running into OOM issues. Quick question - how much system RAM are you using for the offloading? I only have 32GB and wondering if that's my bottleneck. Also, are you using any specific batch size tricks or just micro-batching down to size 1?
Really interesting! I haven't tried on such a large model, but I've experimented with a similar approach using MeTA's Hydra, which allows dynamic switching between CPU and GPU for better memory handling. It might not match ZeRO in efficiency, but it could be worth exploring alongside what you've done. Would love to know how it stacks up against DeepSpeed if anyone's compared them.
I've tried something similar on my 3080 and while it's definitely challenging, using ZeRO and gradient checkpointing has made it somewhat manageable for smaller models. How are you handling high disk I/O with activation offloading though? I noticed my disk becomes a bottleneck pretty quickly.
Wow, that's pretty impressive for a single RTX 4090! I've been working with slightly smaller models, around 10B parameters, using similar techniques on my GTX 3090. One thing I found helpful was taking frequent checkpoints to avoid redoing a lot of lost work in case of failure. Curious if you're encountering any stability issues with bfloat16 though?
I've successfully trained 13B models on my setup using a dual RTX 3070 configuration. In my case, leveraging tensor rematerialization in addition to your mentioned methods really helped to further reduce memory pressure. Also, NVIDIA's Triton for custom kernels sped up a few specific ops, which was a nice bonus.
I'm using a similar setup on my RTX 3090, focusing mainly on the ZeRO optimizer with DeepSpeed. I'm seeing slower throughput, around 0.2 tokens/sec, but it's amazing that we can even attempt these large models on consumer GPUs! I'm curious about your activation offloading strategy – are you using a specific library for that, or just manual CPU memory management?