Been experimenting with multi-model routing to balance cost and quality for our chat application. Currently using a simple rule-based approach where simple queries (<50 tokens, no code) go to GPT-3.5-turbo ($0.001/1K tokens) and complex ones hit GPT-4 ($0.03/1K tokens).
Seeing about 40% cost reduction but the routing logic feels pretty naive. Sometimes GPT-3.5 fails on what looks like a simple task and we have to fallback to GPT-4 anyway.
Anyone using more sophisticated routing? I've looked at:
The classifier approach seems promising but requires labeled training data. Currently tracking success/failure rates:
gpt35_success_rate = 0.73
gpt4_success_rate = 0.94
avg_cost_per_request = 0.008 # down from 0.013
Also considering Anthropic's Claude models in the mix since their pricing is competitive. Has anyone built a decision tree or used LangChain's router chains effectively? Really curious about real-world implementations beyond the basic tutorials.
I'd argue against token counting for routing - it's too brittle. We switched to a semantic classifier that analyzes query complexity using embeddings. Takes ~20ms extra latency but routes way more accurately. Train it on your own data where you know which model performed better. Our false positive rate (sending simple queries to expensive models) dropped from 23% to 8% after the switch. The upfront ML work pays off pretty quickly.
We're running similar setup but with three tiers: GPT-3.5 for basic Q&A, GPT-4 for reasoning/analysis, and Claude for long-form content. Numbers after 2 months: 65% traffic to 3.5 ($847/month), 30% to GPT-4 ($2,134/month), 5% to Claude ($312/month). Total cost down 52% vs all-GPT-4. Key insight: we added a fallback mechanism - if 3.5 returns low confidence scores, auto-retry with GPT-4. Adds complexity but catches those edge cases you mentioned.