Langfuse Review — Features, Pricing & User Sentiment | Payloop

Langfuse

observabilitysubscription + tiered

Traces, evals, prompt management and metrics to debug and improve your LLM application.

Langfuse is recognized for its capability to effectively track LLM calls, providing visibility into AI operations which is crucial for production environments. However, some users have raised concerns about its lack of understanding of agent topology and potential interoperability limitations with other tracing formats. There isn't much specific sentiment mentioned about pricing, but there seems to be an implication that it's a paid solution compared to some open-source alternatives. Overall, Langfuse is appreciated as a valuable tool for observability in AI, though it faces some competition from both paid and open-source tools offering varied features.

Mentions (30d)

0

Reviews

0

Platforms

4

GitHub Stars

24,100

2,434 forks

15 integrations1 features870,710 npm downloads/wkMerger / Acquisition

Voices Discussing Langfuse

Max Mergenthaler

CEO at Langfuse

1 mention

Chris Lattner

CEO at Modular AI (Mojo)

1 mention

Latest Videos

Langfuse Context: All things MCP with Adam Jones (Tech Lead at Anthropic)

Langfuse Context: All things MCP with Adam Jones (Tech Lead at Anthropic)

Jan 6, 2026

Continuous Evaluation, Monitoring, and Operations of AI Agents with AWS Bedrock AgentCore & Langfuse

Continuous Evaluation, Monitoring, and Operations of AI Agents with AWS Bedrock AgentCore & Langfuse

Share:Twitter LinkedIn

Product Screenshots

Langfuse screenshot 1

Langfuse screenshot 2

AI Summary

Langfuse is recognized for its capability to effectively track LLM calls, providing visibility into AI operations which is crucial for production environments. However, some users have raised concerns about its lack of understanding of agent topology and potential interoperability limitations with other tracing formats. There isn't much specific sentiment mentioned about pricing, but there seems to be an implication that it's a paid solution compared to some open-source alternatives. Overall, Langfuse is appreciated as a valuable tool for observability in AI, though it faces some competition from both paid and open-source tools offering varied features.

Features & Use Cases

Features

Gain deep visibility into your traces

Use Cases

Monitoring LLM performance in productionTracking API usage and costsAnalyzing user interactions with LLMsIdentifying bottlenecks in LLM workflowsDebugging multi-agent systemsOptimizing LLM response timesConducting A/B testing on LLM outputsCollecting feedback for LLM improvements

Company Intel

Industry

information technology & services

Employees

19

Funding Stage

Merger / Acquisition

Total Funding

$4.1M

Social Reach

828

GitHub followers

Developer Ecosystem

18

GitHub repos

24,100

GitHub stars

20

npm packages

22

HuggingFace models

870,710

npm downloads/wk

19,249,322

PyPI downloads/mo

Top Mention

devto@vola-trebla24 engagement3/21/2026

OpenTelemetry just standardized LLM tracing. Here's what it actually looks like in code.

Every LLM tool invents its own tracing format. Langfuse has one. Helicone has one. Arize has one. If...

Mentions by Platform

youtube

Langfuse AI

Langfuse AI

youtube

Langfuse AI

Langfuse AI

youtube

Langfuse AI

Langfuse AI

youtube

Langfuse AI

Langfuse AI

youtube

Langfuse AI

Langfuse AI

Pricing

subscription + tiered

Pricing found: $29 / month, $8/100k, $199 / month, $8/100k, $300/mo

Mention Activity (Last 12 Weeks)

Platform Distribution

Sentiment Overview

Positive19% (4)

Neutral81% (17)

Negative0% (0)

Common Pain Points

cost tracking (3)anthropic bill (1)surprise bill (1)cost monitoring (1)usage monitoring (1)token usage (1)

Top Topics

pricing (3)api (3)model selection (3)agents (3)cost optimization (3)scalability (2)open source (2)streaming (2)workflow (2)security (1)support (1)migration (1)data privacy (1)performance (1)documentation (1)deployment (1)RAG (1)

Recent Mentions

youtube

Langfuse AI

Langfuse AI

youtube

Langfuse AI

Langfuse AI

youtube

Langfuse AI

Langfuse AI

youtube

Langfuse AI

Langfuse AI

youtube

Langfuse AI

Langfuse AI

reddit@[unknown]6/20/2026

Most AI features don't fail because of the model

Been sitting on this for a bit after watching an AI feature at my last job basically die a slow death post-launch, and I think the model-failure explanation is usually a red herring tbh. Concrete version of what I mean. We had an agent doing first-pass triage on inbound support tickets, routing + drafting a suggested reply for a human to approve. Launched, looked great for like 6 weeks. Engineering was watching latency (fine, consistently under 2s) and error rate (also fine, sub 1%). Product was watching ticket resolution time, which actually improved initially. Meanwhile the support team itself started quietly noticing the suggested replies were getting weirdly generic for a specific category of tickets, nothing crashing, nothing erroring, just worse. They mentioned it in a slack channel a couple times. Nobody connected it to anything bc it wasnt anyone's job to connect it, support flagged quality, eng was looking at uptime, product was looking at a downstream metric that hadnt actually moved yet bc the degradation was gradual. By the time it showed up as an actual problem (resolution time metric finally dipped, maybe 2 months in) everyone's first assumption was "the model must have changed" or "we need a better prompt." Root cause when we actually dug in was a data source the agent pulled context from had silently started returning stale info after an unrelated pipeline change. Not a model problem at all. A "three teams had three different partial views of the same system and none of them overlapped" problem. Seen versions of this with teams running LangSmith, Langfuse, even fully custom setups someone built in-house. The specific tool wasnt really the variable. What was missing every time was something dumber than tooling, just a shared place where the trace, the quality complaint, and the downstream metric could actually sit next to each other and get looked at by someone who could act on all three at once. Could be pattern matching on too small a sample, genuinely not sure. But curious if this tracks for anyone else. What actually killed your AI feature after launch, was it actually the model, or was it more of a "nobody owned the full picture" thing dressed up as a model problem after the fact submitted by /u/northernBladee [link] [comments]

reddit@[unknown]6/18/2026

Building independent LLM drift detection - sharing the methodology, looking for feedback on the approach

Disclosed upfront: I run [Tickerr dot ai], an independent external monitor for AI APIs. Today it tracks latency, TTFT, uptime, and error rates across major models. I’m trying to validate a more specific idea before building too much. Basic transport health is not the hard part. If Claude/OpenAI/Gemini gets slow, times out, or throws 5xx errors, most teams can catch that with APM, logs, Sentry, Langfuse, Helicone, Datadog, etc. The harder failure mode seems to be silent model behavior drift when API returns 200, latency is normal, no exception is thrown, output looks plausible, but JSON adherence, tool-calling, refusal behavior, reasoning quality, or instruction-following has quietly degraded. This gets worse with agentic systems. In a normal chat, drift may produce a bad answer but in an agentic workflow, the model can silently choose the wrong tool, stop early, mark a task as complete, or take a bad action while everything still looks successful at the API level. The system is running and confidently doing worse work. User complaints are still the primary detection mechanism currently for these. VIGIL (arXiv 2605.08747) found 65 to 88 percent of false-success reports happened at literally zero task progress. DeployBench (2606.05238) found most failures were the system stopping against a softer bar it set for itself and returning clean. Plausible-in-isolation is the failure mode itself, not a sign you are safe, which is why a single model's output never alerts on its own. That's what I'm thinking to build - an external drift detection probe on top LLM APIs, that stays out of your system and does continuous checks every hour, to find out these silent degradations, and sends proactive alerts. Rough idea: External canary suite: run private fixed prompts on a schedule against major models. Track schema adherence, instruction-following, refusal/over-refusal, output length, tool-call format, and simple deterministic correctness checks. Drift baseline: Do not judge a single output in isolation. Track whether today’s behavior has materially shifted versus that model’s own baseline. Cross-model comparison: For some task types, compare model behavior against peer models. Not to say which model is “right”, but to detect abnormal divergence. Example: “Sonnet and Gemini usually disagree 12% of the time on this task type; today disagreement is 28%.” Optional bring your own prompts: A paid tier where you provide some critical prompts from your own workload. Tickerr runs them on a schedule and alerts if behavior drifts from your baseline. Prompts would remain private and would not be public benchmark prompts. What I’m trying to learn: Is this technically sound enough to be useful, or are there are other failure modes that I am missing / are more valuable ? Which alerts would you actually care about? JSON/schema adherence drift tool-call format drift refusal/over-refusal drift output length drift cross-model disagreement spike bring-your-own-prompt regression alerts Would you pay for this, or would you just build it yourself? If you would pay, what pricing feels realistic? $19/month $99/month $299+/month for team/Slack/webhook/BYO prompts Brutal feedback welcome. If this is not a real pain, I’d rather know now, or which direction you feel makes more sense to take this. submitted by /u/Remarkable_Divide755 [link] [comments]

reddit@[unknown]6/6/2026

I built a local CLI to estimate and cap AI coding-agent spend before a run gets expensive

I build apps with coding agents, and one thing kept bothering me: before starting a run, I often had no idea what it might cost. Sometimes the agent is useful. Sometimes it keeps retrying the same bad path, rewrites its plan, burns tokens, and only later I realize that the run was more expensive than expected. So I built Runcap. It is a free MIT local CLI for developers using AI coding agents. The idea is simple: estimate a run before starting set a hard budget cap run a local gateway that can stop over-budget calls compress logs / JSON / stack traces before forwarding record what happened during the run generate a rescue prompt when the agent gets stuck It is not trying to replace Langfuse, LiteLLM, Helicone, or other observability/gateway tools. Those are useful, but I wanted something smaller and more direct for my own workflow: a local “cost seatbelt” before a coding-agent run gets out of control. Install: npm install -g runcap GitHub: https://github.com/kirder24-code/ai-agent-manager It is still early and probably rough. I would really appreciate feedback from people using Claude Code, Cursor, Codex, Aider, or other coding-agent workflows. Main question: would you actually keep a tool like this running day to day, or is this too much friction for your workflow? submitted by /u/Ok-Serve4908 [link] [comments]

reddit@[unknown]5/31/2026

I stopped using Claude in the browser for 80% of my daily tasks and my usage actually went up

This is going to sound counterintuitive but let me explain. I love Claude. I use Opus for deep work, Sonnet for quick stuff. I was probably using claude 15 to 20 times a day. Summaries, brainstorming, code review, email drafts, research questions. Standard knowledge worker usage. But I noticed a pattern. Most of my usage happened in bursts. I would open Claude, do 4 or 5 things, then close it and not come back for 3 hours. Not because I did not need it, but because I forgot about it. I was deep in something else and the thought "I should ask Claude about this" did not occur to me in the moment. So I built a small thing. An agent that runs Claude Sonnet on the backend, connected to my calendar, todoist, email, and a few notion databases. It lives as a contact in my iMessage called "C" (very creative I know). Now instead of opening claude when I remember to, I text C throughout the day the same way I text anyone else. "What is on my calendar after 3pm." "Draft a reply to that email from alex, keep it short, say yes to the timeline." "Remind me to review the pitch deck before tomorrow's call." "What did I write in my product notes last week about the onboarding flow." My actual Claude usage went UP significantly. Not because the model got better but because the access point changed. Texting is a zero-friction action I already do 80 times a day. Opening a browser tab is a deliberate decision I have to remember to make. The deep work still happens in claude.ai. When I need the full context window, artifacts, file uploads, the browser is still better. But that is maybe 20% of my interactions. The other 80% are quick, context-specific queries that take 30 seconds and are perfectly suited to a text message. Stack: claude sonnet via API, a small express server for the tool integrations (google calendar, todoist, notion, gmail), photon codes for iMessage delivery, deployed on a $7 render instance. Langfuse for tracing when something goes weird. Total cost is about $35 a month in API calls which is less than what I was already spending on the Pro subscription that I still also have. The meta point: Claude is incredible. The browser is holding it back for most daily use cases. Not because the browser is bad but because it requires intent. The best AI interactions are the ones that happen when you barely think about it. submitted by /u/ScaryAd2555 [link] [comments]

reddit@[unknown]5/26/2026

Made an awesome-list for everything LLM cost, would love contributions

So a few months back I got surprised by my Anthropic bill which somehow racked up like $400 ish on a staging key in a few weeks just running evals, no budget cap pretty dumb in hindsight I mean it’s not a big cost but I should have been careful nonetheless After that I started keeping a notes file of tools that actually helped reduce cost stuff like token counters, pricing pages that update properly, caching layers, prompt compression libs, observability tools (helicone, langfuse, langsmith, etc) it slowly grew to 80–90 entries so I cleaned it up and put it on github: https://github.com/ankitvirdi4/awesome-llm-cost what’s in there right now: pricing calculators + token counters observability / tracing (helicone, langfuse, langsmith, openllmetry, phoenix) caching (gptcache, semantic caching approaches) model routers (openrouter, notdiamond, portkey) prompt compression + context window stuff eval cost tracking self hosting / GPU cost calculators everything is linted (awesome-lint), short descriptions for each entry, and I checked links recently so nothing should be dead if there’s anything you’ve used that saved you money on inference, drop it here or send a PR especially looking for more prompt compression stuff, that section feels kinda weak rn not affiliated with anything listed btw just got tired of having 80 bookmarks submitted by /u/OldComposerbruh [link] [comments]

reddit@[unknown]5/24/2026

ig nobody is talking about the real reason most AI agents fail in the real world

we spend a lot of time in this community talking about capabilities. context windows, reasoning benchmarks, multi-step tool use, how well a model can write code or pass a bar exam. i'm not dismissing any of that. capabilities matter. but when i look at AI products failing in production, the capability of the model is almost never the issue. ive been building and consulting on AI agents for about 18 months. the failure modes i see constantly are: users do not go where the agent lives. the agent has a beautiful web interface. the user visits it twice and stops. not because the agent was unhelpful. because opening a browser tab is a cognitive action that requires intention, and most of daily life does not create the right moment for that intention. humans do not change their behavior to accommodate useful tools. useful tools have to show up in the behavior humans already have. the agent is reactive when it needs to be proactive. the smartest human assistant you have ever had did not just answer questions. they showed up. they flagged things before you asked. they sent you the thing you did not know you needed. most AI agents are search bars with a personality. they wait. waiting is not intelligence in practice. intelligence in practice is noticing and acting. the agent has no memory of who you are. you tell it your preferences, your context, your situation, and then come back 3 days later and it knows nothing. this is not a model limitation. the model can remember if you feed it the right context. this is an architecture choice that most teams make wrong because they are thinking about sessions instead of relationships. the agents that are succeeding in production are not necessarily the ones with the best models. they are the ones that live in whatsapp and imessage and telegram where users already are. that proactively reach out when something relevant happens. that maintain coherent memory of the person across weeks and months of conversation. the tooling to build this way exists now. agno and langchain for orchestration, photon codes for the cross channel messaging surface, langfuse for traces and memory debugging, good persistence in postgres or supabase. the architecture is not magic. what is still rare is the mindset of treating the channel and the memory as primary constraints rather than afterthoughts. i think the gap between what AI agents can theoretically do and what they actually do for people in their daily lives is almost entirely a distribution and persistence problem, not a capability problem. we are solving for the wrong thing. submitted by /u/bcoz_why_not__ [link] [comments]

reddit@[unknown]5/21/2026

How are people actually tracking OpenAI costs in production?

Curious what this community actually uses for OpenAI cost monitoring on real production apps. There are a lot of "I got a $X surprise bill" posts here, but I rarely see the follow-up: what tooling did people land on after the wake-up call? For those running OpenAI in production: - Real-time tracking or just checking the billing dashboard monthly? - Rolling your own or using a tool (Helicone, Langfuse, etc.)? - Breaking costs down per user / per feature, or just looking at the total? Asking because I'm building in this space and trying to figure out what people actually do vs. what they say they should do. submitted by /u/VariousHour7390 [link] [comments]

reddit@[unknown]5/6/2026

Built a Claude Code monitoring tool

Built a lightweight monitoring & observability tool for Claude Code, runs inside VSCode. Repo: https://github.com/yessGlory17/argus Quick demo: https://www.youtube.com/watch?v=HmHOI1PBn_M If Argus helps you ship better Claude Code sessions, I would greatly appreciate a GitHub Star. submitted by /u/fIak88 [link] [comments]

reddit@[unknown]5/4/2026

Anyone actually built a real feedback loop for Claude agents in production? Because "run evals and pray" isn't cutting it

So I've been running a multi-agent setup with Claude for a few months now mostly customer-facing stuff, some internal tooling. And i keep hitting this problem that I think a lot of people here are probably dealing with too but nobody really talks about. You ship a prompt change. Or you swap from Sonnet to Opus for one step in the chain. Or you add a new tool. Everything looks fine in your evals. You push it. Then three days later someone on the team notices the agent is subtly doing something wrong not catastrophically wrong, just... You can sense something's off. Maybe it stopped including a specific field in its output. Maybe it started being way too verbose in one branch of the logic. Whatever it is, it's not a crash, it's a vibe shift. And then you're sitting there doing archaeology on your own system. Manually diffing outputs, reading through traces, asking teammates "hey did you notice anything weird last Tuesday." It's miserable. I've been thinking a lot about what the fastest feedback loop in agent engineering that almost nobody is running actually looks like. Because right now my loop is: ship change → wait for someone to complain → investigate → fix → hope I didn't break something else That's... pre-CI/CD era thinking applied to agents. And it's wild that this is where most of us are at. The thing is, traditional software solved this ages ago. You write tests, you run them in CI, you get red/green before merge. But agents are so much messier. Outputs are non-deterministic, "correct" is fuzzy, and the failure modes are subtle behavioral drift rather than stack traces. So most teams I talk to (including mine honestly) end up relying on vibes. Does the agent feel like it's working? Cool, ship it. What I actually want is something that: Watches production behavior continuously Notices when things drift from expected patterns Connects the regression to the specific change that caused it Tells me before a customer does Ideally feeds that learning back so the same failure doesn't happen again I have tracing set up (Langfuse). It's good for what it does. But it still feels like it stops at "here's what happened" rather than "here's what went wrong and why." I generate a ton of observability data that nobody looks at until something is already broken. The closed-loop part where the system actually learns from failures that's what's missing. I've been looking at a few things. LangSmith, Arize, Braintrust... they all cover pieces of this. Recently stumbled on Bento which seems to be trying to do the full closed-loop thing — tracing + regression detection + feeding fixes back into the system. Haven't gone deep enough to know if it actually delivers on that promise but the framing resonates with what I'm trying to build. If anyone's tried it i'd be curious to hear. But honestly I'm more interested in hearing what people here have actually built or cobbled together. Like: - Are you running evals against production traffic or just pre-deploy? - How do you detect behavioral drift that isn't an outright error? - When you find a regression, how do you trace it back to which change caused it? - Has anyone built something where the agent actually gets better from production failures automatically rather than you manually tweaking prompts? I feel like this is the unsexy infrastructure problem that's going to separate teams who can actually run agents reliably from teams who are perpetually firefighting. But maybe I'm overthinking this and everyone's just vibing their way through production lol Would love to hear what your setups look like, especially if you're running Claude agents at any kind of scale where you can't just eyeball every interaction. submitted by /u/Fine-Discipline-818 [link] [comments]

reddit@[unknown]4/21/2026

working on a small add-on that tells me what actually mattered in a session, would love feedback!

https://preview.redd.it/mrdha7g6xfwg1.png?width=1504&format=png&auto=webp&s=464cc2ddcbcdbce6664a6c687942559131ac7e26 I’ve been working on a small Claude Code add-on because I keep having the same experience: the task finishes, it mostly works, and I’m still left wondering what it actually did along the way. I know there are already some good ways to get more visibility into Claude Code: - OTEL / Langfuse setups - local dashboards - session timelines - cost / usage monitoring Those all seem useful if you want raw telemetry, team usage, or deeper debugging. But for my own use, a lot of that feels heavier than what I actually want day to day. Most of the time I’m not asking: “show me every event” I’m asking: - what looked weird? - what got blocked? - what did it touch outside the task? - what should I actually review before I trust this? That’s what I’m trying to build with Clawrity. The current idea is a local hook-based reviewer that gives me a short summary after a session, something like: what matters - touched auth/session.ts even though the task was a billing form fix - ran 6 shell commands, including npm install - attempted to read .env; blocked - retried the same migration 3 times review first src/auth/session.ts db/migrations/2026_04_20_add_status.sql package.json So not a dashboard, not a tracing sink, not “more logs.” More like: “ok, what actually deserves my attention before approving this and moving on?” Still early, but I’d really love feedback from people using Claude in more advanced ways than I :) - would you actually want this? - where do existing tools already solve this well enough? - what would make this useful vs just noisy? submitted by /u/Relevant_Decision989 [link] [comments]

reddit@[unknown]4/9/2026

I built an open-source tool that shows exactly where your Claude Code tokens go

I was spending $200+/month on Claude Code with zero visibility into where the money went. So I built AgentTrace. Existing tools (LangSmith, Langfuse) trace LLM calls — prompt in, completion out. But when your agent spawns 3 sub-agents that read 40 files, search 5 URLs, and retry tests 3 times, you need to know: which decisions were worth the money? AgentTrace traces agent DECISIONS, not API calls. It builds a decision tree showing what each agent chose to do, what it cost, and whether it contributed to the outcome. One command setup: `npm install -g agenttrace-sdk && agenttrace init` Every Claude Code session auto-generates a cost report showing effective spend vs waste, with actionable recommendations and projected weekly savings. Example: a $1.97 session showed 42% waste — research agent read 6 irrelevant files, docs agent fetched 4 redundant pages, 2 test failures from missing env vars. Each finding comes with a specific fix. Open source, MIT licensed. Would love feedback from this community since you're the ones actually spending on Claude Code daily. submitted by /u/Intrepid_Income6025 [link] [comments]

devto@vola-trebla24 engagement3/21/2026

OpenTelemetry just standardized LLM tracing. Here's what it actually looks like in code.

Every LLM tool invents its own tracing format. Langfuse has one. Helicone has one. Arize has one. If...

reddit@[unknown]3/21/2026

Built an open-source Agent Firewall to see what Claude Code & MCP servers are actually doing on your machine

I built this after realizing Claude Code was autonomously modifying files, calling APIs, and interacting with my MCP servers—and I had zero visibility into what was happening or why. Unalome Agent Firewall is a free, local-first desktop app (Tauri v2 + Rust + React, Apache 2.0) that runs entirely on your machine and gives you real-time visibility into: What it does: - Auto-detects Claude Code, Claude Desktop, running MCP servers - Real-time action timeline—see every file change, API call, connection - Auto-backup files before agent modifications + one-click restore - PII Guardian—scans for exposed API keys, passwords, credit cards - Connection Monitor—logs outbound traffic, flags unknown domains - Cost Tracker—per-model spend across 40+ Claude models + budget limits - Kill Switch—pause Claude Code or any MCP server instantly - MCP Security Scanner—detects prompt injection, dangerous capabilities - Weekly Activity Report—exportable, shareable HTML summary Why I built this: The transparency gap felt critical. Claude Code can read/write files, execute code, interact with MCP servers, and I realized I had no structured way to audit what it actually did. Existing tools (LangSmith, Langfuse) are built for production teams; nothing existed for an individual developer who just wants to know: what did my agent do? Plus, the MCP security landscape in 2025 is rough. Real-world attacks via tool poisoning and prompt injection have exfiltrated private repo code, API keys, and chat histories. A scan of 2,614 MCP implementations found 82% vulnerable to path traversal. The issue: users had no visibility into what was happening. Status: - v0.1.0 fully built & signed (macOS: signed + notarized; Linux: .deb/.rpm/.AppImage; Windows: .msi/.exe) - Open-source, Apache 2.0 - Repo: https://github.com/unalome-ai/unalome-firewall Happy to discuss the MCP detection approach, Tauri/Rust stack, or how to extend support for other agents. Feedback welcome—especially on what other Claude integrations people want covered. submitted by /u/Status_Degree_6469 [link] [comments]

pricingapisecurityscalability

reddit@[unknown]3/20/2026

My chatbot switches from text to voice mid-conversation. same memory, same context, you just start talking. 2 months of Claude, open-sourcing it for you to try.

been building this since late january. started as a weekend RAG chatbot so visitors could ask about my work. it answers from my case studies. that part was straightforward. then i kept going and it turned into the best learning experience i've had with Claude. still a work in progress. there are UI bugs i'm fixing and voice mode has edge cases. but the architecture is solid and you can try it right now. the whole thing was built with Claude Code. the chatbot runs on Claude Sonnet, and Claude Code wrote most of the codebase including the eval framework. two months of building every other day and i've learned more about production LLM systems than in any course. here's what's in it: streaming responses. tokens come in one by one, not dumped as a wall of text. i tuned the speed so you can actually follow along as it writes. fast enough to feel responsive, slow enough to read comfortably. like watching it think. text to voice mid-conversation. you're chatting with those streaming responses, and at any point you hit the mic and just start talking. same context, same memory. OpenAI Realtime API handles speech-to-speech. keeping state synced between both modes was the hardest part to get right. RAG with contextual links. the chatbot doesn't just answer. when it pulls from a case study, it shows you a clickable link to that article right in the conversation. every new article i publish gets indexed automatically via RAG. i don't touch the prompt. the chatbot learns new content on its own just by me publishing it. 71 automated evals across 10 categories. factual accuracy, safety/jailbreak, RAG quality, source attribution, multi-turn, voice quality. every PR runs the full suite. i broke prod twice before building this. 53 of the 71 evals exist because something actually broke. the system writes tests from its own failures. 6-layer defense against prompt injection. keyword detection, canary tokens, fingerprinting, anti-extraction, online safety scoring (Haiku rates every response in background), and an adversarial red team that auto-generates 20+ attack variants. someone tried to jailbreak it after i shared it on linkedin. that's when i took security seriously. observability dashboard. every decision the pipeline makes gets traced in Langfuse: tool_decision, embedding, retrieval, reranking, generation. built a custom dashboard with 8 tabs to monitor it all. stack: Claude Sonnet (generation + tool_use), OpenAI embeddings (pgvector), Haiku (background safety scoring), Langfuse, Supabase, Vercel. like i said, it's not perfect. some UI rough edges, voice mode still needs polish on certain browsers. but the core works and everything is in the repo. repo: github.com/santifer/cv-santiago (the repo has everything. RAG pipeline, defense layers, eval suite, prompt templates, voice mode). feel free to clone and try. happy to answer questions. submitted by /u/Beach-Independent [link] [comments]

hackernews@jairooh4 engagement3/8/2026

Ask HN: How are you monitoring AI agents in production?

With the recent incidents (DataTalks database wipe by Claude Code, Replit agent deleting data during code freeze), it's clear that running AI agents in production without observability is risky.<p>Common failure modes I've seen: no visibility into what the agent did step-by-step, surprise LLM bills from untracked token usage, risky outputs going undetected, and no audit trail for post-mortems.<p>I've been building AgentShield (https://useagentshield.com) — an observability SDK for AI agents. It does execution tracing, risk detection on outputs, cost tracking per agent/model, and human-in-the-loop approval for high-risk actions. Plugs into LangChain, CrewAI, and OpenAI Agents SDK with a 2-line integration.<p>Curious what others are using. Rolling your own monitoring? LangSmith? Langfuse? Or just hoping for the best?

pricingapiscalabilitymodel selection

Integrations

OpenAIAWS LambdaClickhouseSlackZapierGitHubGoogle Cloud PlatformMicrosoft AzureJiraTrelloNotionDatadogSentryPrometheusGrafana

Categories

AI/MLDevOpsSecurityAnalyticsDeveloper Tools

Repository Audit Available

Deep analysis of langfuse/langfuse — architecture, costs, security, dependencies & more

View Full Audit

Langfuse Alternatives

Compare similar observability tools

All observability Tools

Browse the full category

Frequently Asked Questions

How much does Langfuse cost?▼

Pricing found: $29 / month, $8/100k, $199 / month, $8/100k, $300/mo

What are the main features of Langfuse?▼

Key features include: Gain deep visibility into your traces.

What is Langfuse used for?▼

Langfuse is commonly used for: Monitoring LLM performance in production, Tracking API usage and costs, Analyzing user interactions with LLMs, Identifying bottlenecks in LLM workflows, Debugging multi-agent systems, Optimizing LLM response times.

What does Langfuse integrate with?▼

Langfuse integrates with: OpenAI, AWS Lambda, Clickhouse, Slack, Zapier, GitHub, Google Cloud Platform, Microsoft Azure, Jira, Trello.

Is Langfuse open source?▼

Langfuse has a public GitHub repository with 24,100 stars.

What are common complaints about Langfuse?▼