Turn production traces into evals, compare prompts and models, and improve quality with every release.
User feedback on Braintrust suggests a strong reputation for innovation in AI-driven solutions, particularly appreciated for its user-friendly interface and robust performance in both customer-facing and internal tools. However, some users express dissatisfaction with the lack of effective feedback loop mechanisms, describing current systems as inadequate and frustrating. The pricing sentiment is generally positive, indicating value for money given its advanced features. Overall, Braintrust is regarded positively, though users hope for improvements in the feedback and evaluation processes.
Mentions (30d)
1
Reviews
0
Platforms
2
GitHub Stars
12
3 forks
User feedback on Braintrust suggests a strong reputation for innovation in AI-driven solutions, particularly appreciated for its user-friendly interface and robust performance in both customer-facing and internal tools. However, some users express dissatisfaction with the lack of effective feedback loop mechanisms, describing current systems as inadequate and frustrating. The pricing sentiment is generally positive, indicating value for money given its advanced features. Overall, Braintrust is regarded positively, though users hope for improvements in the feedback and evaluation processes.
Features
Use Cases
Industry
information technology & services
Employees
160
Funding Stage
Series B
Total Funding
$121.1M
174
GitHub followers
85
GitHub repos
12
GitHub stars
2
npm packages
Pricing found: $0 / month, $4/gb, $2.50/1k, $249 / month, $3/gb
Anyone actually built a real feedback loop for Claude agents in production? Because "run evals and pray" isn't cutting it
So I've been running a multi-agent setup with Claude for a few months now mostly customer-facing stuff, some internal tooling. And i keep hitting this problem that I think a lot of people here are probably dealing with too but nobody really talks about. You ship a prompt change. Or you swap from Sonnet to Opus for one step in the chain. Or you add a new tool. Everything looks fine in your evals. You push it. Then three days later someone on the team notices the agent is subtly doing something wrong not catastrophically wrong, just... You can sense something's off. Maybe it stopped including a specific field in its output. Maybe it started being way too verbose in one branch of the logic. Whatever it is, it's not a crash, it's a vibe shift. And then you're sitting there doing archaeology on your own system. Manually diffing outputs, reading through traces, asking teammates "hey did you notice anything weird last Tuesday." It's miserable. I've been thinking a lot about what the fastest feedback loop in agent engineering that almost nobody is running actually looks like. Because right now my loop is: ship change → wait for someone to complain → investigate → fix → hope I didn't break something else That's... pre-CI/CD era thinking applied to agents. And it's wild that this is where most of us are at. The thing is, traditional software solved this ages ago. You write tests, you run them in CI, you get red/green before merge. But agents are so much messier. Outputs are non-deterministic, "correct" is fuzzy, and the failure modes are subtle behavioral drift rather than stack traces. So most teams I talk to (including mine honestly) end up relying on vibes. Does the agent feel like it's working? Cool, ship it. What I actually want is something that: Watches production behavior continuously Notices when things drift from expected patterns Connects the regression to the specific change that caused it Tells me before a customer does Ideally feeds that learning back so the same failure doesn't happen again I have tracing set up (Langfuse). It's good for what it does. But it still feels like it stops at "here's what happened" rather than "here's what went wrong and why." I generate a ton of observability data that nobody looks at until something is already broken. The closed-loop part where the system actually learns from failures that's what's missing. I've been looking at a few things. LangSmith, Arize, Braintrust... they all cover pieces of this. Recently stumbled on Bento which seems to be trying to do the full closed-loop thing — tracing + regression detection + feeding fixes back into the system. Haven't gone deep enough to know if it actually delivers on that promise but the framing resonates with what I'm trying to build. If anyone's tried it i'd be curious to hear. But honestly I'm more interested in hearing what people here have actually built or cobbled together. Like: - Are you running evals against production traffic or just pre-deploy? - How do you detect behavioral drift that isn't an outright error? - When you find a regression, how do you trace it back to which change caused it? - Has anyone built something where the agent actually gets better from production failures automatically rather than you manually tweaking prompts? I feel like this is the unsexy infrastructure problem that's going to separate teams who can actually run agents reliably from teams who are perpetually firefighting. But maybe I'm overthinking this and everyone's just vibing their way through production lol Would love to hear what your setups look like, especially if you're running Claude agents at any kind of scale where you can't just eyeball every interaction. submitted by /u/Fine-Discipline-818 [link] [comments]
View originalRepository Audit Available
Deep analysis of braintrustdata/braintrust-sdk — architecture, costs, security, dependencies & more
Yes, Braintrust offers a free tier. Pricing found: $0 / month, $4/gb, $2.50/1k, $249 / month, $3/gb
Key features include: Observability, Evals, Everything you need to build smarter, faster, SOC 2 Type II, SSO / SAML, HIPAA compliant, GDPR compliant, Granular permissions.
Braintrust is commonly used for: Monitoring AI model performance in real-time, Detecting anomalies in production environments, Evaluating system latency and response times, Tracking cost efficiency of AI operations, Ensuring compliance with data regulations, Implementing continuous integration and deployment practices.
Braintrust integrates with: AWS CloudWatch, Google Cloud Operations, Azure Monitor, Datadog, Prometheus, Grafana, Slack, Jira, PagerDuty, New Relic.
Braintrust has a public GitHub repository with 12 stars.
Elad Gil
Investor at Elad Gil
1 mention

Improving Your Prompt #ai #evals #prompt #improvement #tech #braintrust #analysis
Mar 30, 2026