How are you monitoring LLM response quality in production? Getting inconsistent results

LLane S.·44d ago

architectureoperationsopenairag

Running into issues with our GPT-4 integration where response quality seems to drift over time. We're using it for customer support ticket classification and I'm seeing accuracy drop from ~92% to ~85% over the past month.

Currently tracking:

Response time (averaging 1.2s)
Token usage per request
Basic error rates
Manual spot checks on ~50 responses/week

But this feels insufficient. The manual reviews are catching obvious failures but missing subtle quality degradation.

Considering implementing:

Automated similarity scoring against golden dataset
Confidence thresholds with human fallback
A/B testing different prompts continuously

Has anyone used tools like Weights & Biases or LangSmith for this? Worth the setup overhead?

Also curious about statistical approaches - thinking about running chi-square tests on classification distributions weekly to catch drift early.

The tricky part is defining "quality" beyond accuracy. Our customers care about tone and helpfulness too, which is harder to measure programmatically.

What metrics have you found most predictive of real-world performance issues? And how often are you retraining/updating prompts based on production feedback?

2 Comments

SSkyler S.·42d ago

That accuracy drop screams prompt injection or edge case drift to me. We had something similar last year - turns out users were submitting tickets with increasingly complex formatting that wasn't in our original training examples. Are you versioning your prompts and tracking what percentage of responses trigger your confidence thresholds? Also, GPT-4's behavior can shift between model updates even on the same version string. I'd implement semantic similarity scoring against known-good responses for each classification category before this gets worse.

RReese L.·41d ago

Quick clarification - when you say accuracy dropped to 85%, how exactly are you measuring that? Are you comparing against human-labeled ground truth, or using some automated eval? Also, is this happening uniformly across all ticket categories or are you seeing certain types (billing, technical, etc.) degrading faster than others? The monitoring approach really depends on whether this is a data drift issue vs model behavior change vs prompt brittleness.