Cut our GPT-4 costs by 60% with this hybrid approach - sharing what worked

CCasey D.·2d ago

cost-optimizationllm-providersbest-practices

Been experimenting with cost reduction for our customer support chatbot that was burning through $2k/month in OpenAI credits. Here's what actually moved the needle:

The setup:

Route simple queries to GPT-3.5-turbo ($0.002/1k tokens)
Only escalate complex stuff to GPT-4 ($0.06/1k tokens)
Added a lightweight classifier using a fine-tuned DistilBERT to decide the routing

Results after 3 weeks:

65% of queries handled by 3.5-turbo
Quality metrics barely changed (4.2/5 vs 4.3/5 user satisfaction)
Monthly cost down to $780

Key tricks:

Aggressive prompt engineering - cut average tokens from 850 to 420
Response caching for common questions (Redis TTL 24hrs)
Smart context truncation - keep only last 2 conversation turns

Anyone else tried routing strategies? Curious about Anthropic's new pricing vs this approach.

Edit: The classifier training cost was ~$200 but paid for itself in week 1

14 Comments

LLucy C·2d ago

This is a really cool strategy! I've been using a similar approach with our e-commerce customer service bot. We actually use GPT-3.5 for 80% of interactions, and only switch to GPT-4 for very niche product queries. It’s saved us around 50% in costs so far. Do you find that your classifier ever misroutes queries, or has it been pretty solid?

CCasey N.·2d ago

This is insightful! We've been using a similar routing setup and saw about 50% cost savings. Instead of DistilBERT, we tried using a simple rule-based system to classify queries, which brought our costs down slightly more since it required less compute. However, I'm intrigued by the use of a lightweight model for classification as it likely improves accuracy. Might give that a shot!

AAsh N·2d ago

Thanks for sharing! We venture a little differently by opting for Google's PaLM API for complex queries due to slightly cheaper token rates. Mixing that with GPT-3.5-turbo has been pretty effective on our end; however, the language nuances from GPT-4 do make a slight difference in quality. Have you compared these models side-by-side by any chance?

MMarley C.·2d ago

Nice work on the prompt engineering - cutting tokens in half is huge. We tried Claude-2 for a similar use case and honestly the routing complexity wasn't worth it. Claude's pricing is competitive enough that we just use it for everything now. Running about $900/month vs your $780 but zero routing headaches and the quality is consistently better than 3.5-turbo for our support queries.

LLuke R·2d ago

This is brilliant! We're doing something similar but with a simpler rule-based router (keyword matching + intent confidence scores). Getting about 70% to GPT-3.5 but your DistilBERT approach sounds way more sophisticated. How much training data did you need for the classifier? And are you handling edge cases where 3.5 fails and you need to retry with GPT-4?

EEric V.·2d ago

I'm curious about your experience with prompt engineering. You mentioned reducing token usage significantly; any specific techniques you found most effective for trimming down your prompts? We've been struggling with prompt length too and any insights would be awesome!

LLeo T·1d ago

I've been using a similar setup but with Claude from Anthropic for some of our queries. Pricing-wise, it’s slightly higher than GPT-3.5-turbo, but I found it does slightly better with nuanced language, especially with technical jargon. Still, your use of Redis for caching is genius! I'll have to implement a similar caching strategy since a lot of our incoming queries are repetitive as well. Thanks for the tip!

IIzzy J·1d ago

I've been thinking about a similar approach, but using Azure OpenAI's service. Anyone have experience with pricing there? Wondering if server costs might offset the savings.

RRiley C.·1d ago

We did something similar with a mix of GPT-3.5 and older models like GPT-3 for super basic queries. Managed to cut costs by about 50%, but your setup with the classifier is really interesting. Do you find any delay with the classifier decision making, or is it pretty seamless?

TTara Y.·1d ago

Great insights! We've used a similar prompt engineering strategy, reducing our token count significantly. Dropped from an average of 900 tokens to 500 on our customer service bot without losing essential context. Our costs decreased around 55%, maintaining user satisfaction around 4.1 out of 5. Love hearing real-world applications of prompt optimization!

JJamie C.·1d ago

I've had success with a lightweight ensemble model approach—using three different language models, including an open-source one for the simplest queries. While it complicated the setup, our costs dropped by 50% and it added flexibility for future AI model integrations.

MMia B.·1d ago

We've implemented a similar strategy but used a rule-based engine for simple queries instead of a DistilBERT classifier. This reduced our training costs immensely, and we're paying less than $500 now. However, it's not as flexible for evolving queries. Anyone found a sweet spot balancing both?

HHarper N.·13h ago

Great insight on using DistilBERT for routing! I've been toying with BERT-based classifiers too, although I've found RoBERTa to be slightly more accurate but a bit more costly to run. Curious, did you explore any fallback mechanisms for when your classifier misroutes a query? It's something I've been considering implementing.

SSara K·12h ago

I'm curious about the aggressive prompt engineering you mentioned. Our team's been struggling to keep token usage low without sacrificing response quality. Can you share some specific techniques or examples that helped reduce the tokens used in responses?