Hey folks!
I've been knee-deep in evaluating whether to stick with OpenAI's API or pivot towards hosting a model like GPT-J (or even GPT-NeoX). The decision seems to hinge on more than just server costs.
For context, we're running a text generation service with roughly 500k queries per month. Currently, we shell out about $5,000/mo using GPT-4 via API. I've read some promising posts about folks self-hosting similar models on GPUs like the A100s, but I'm unsure of the hidden costs—it can't just be the AWS bills!
Anyone who did a full TCO analysis willing to share their insights? My list of considerations include:
Would love to hear how others handle this calculus, especially if you've flipped from API to self-hosting!
I'm curious if anyone's quantified the cost of downtime or service outages when self-hosting. With APIs, the SLA often guarantees a certain level of uptime, which can be worth the premium. What have folks experienced in terms of outage frequency and recovery time, especially working with models like GPT-NeoX?
In case you're looking for alternatives, we found Google's TPU pricing to be competitive for our workload compared to AWS. Also, using preemptible instances can significantly bring down costs if your application can tolerate some downtime. It's a trade-off but worth considering!
It really comes down to your specific use case. We moved to self-hosting GPT-NeoX a few months ago, and while the initial GPU costs were hefty, we're now spending around $3,500/month on infrastructure compared to $6,000 when we used an API. The hidden costs are indeed in DevOps and model maintenance. We had a 2-day outage once after a bad update went live; the trade-off is having full control and customization.
We made the switch from an API to self-hosting GPT models a few months ago. The biggest surprise was how much time our DevOps team now spends on system upkeep — we're talking easily 20-30 extra hours a month. Sure, there's a savings on API costs, but it’s offset by the need for skilled personnel who can manage and troubleshoot these systems effectively. If you're considering it, definitely weigh the human resource factor heavily!
Has anyone benchmarked the exact number of queries per second they're getting on a self-hosted setup vs. the API? Curious about throughput differences, especially around peak usage times. Also, for those who've chosen self-hosting, how do you manage security concerns, especially regarding data privacy on rented hardware?
Have you factored in energy costs for on-premise hosting? We noticed a 20% overhead in power costs when we ran a similar setup in our own data center. Also, how do you plan to handle model updates? Keeping a model like GPT-NeoX optimal requires continuous fine-tuning, which can be another expense if you don't have the right experts.
We've been self-hosting GPT-J on A100s for a while now, and I'd say maintenance is definitely the hidden cost people don't initially consider. We're spending around 40% of our DevOps time just on model tuning and scaling optimizations alone. Not to mention, every time a new model comes out, evaluating and integrating it is a whole project in itself. That said, owning our stack means we can fine-tune and pass model improvements onto clients faster.