Traces, evals, prompt management and metrics to debug and improve your LLM application.
Langfuse is recognized for its capability to effectively track LLM calls, providing visibility into AI operations which is crucial for production environments. However, some users have raised concerns about its lack of understanding of agent topology and potential interoperability limitations with other tracing formats. There isn't much specific sentiment mentioned about pricing, but there seems to be an implication that it's a paid solution compared to some open-source alternatives. Overall, Langfuse is appreciated as a valuable tool for observability in AI, though it faces some competition from both paid and open-source tools offering varied features.
Mentions (30d)
0
Reviews
0
Platforms
4
GitHub Stars
24,100
2,434 forks
Langfuse is recognized for its capability to effectively track LLM calls, providing visibility into AI operations which is crucial for production environments. However, some users have raised concerns about its lack of understanding of agent topology and potential interoperability limitations with other tracing formats. There isn't much specific sentiment mentioned about pricing, but there seems to be an implication that it's a paid solution compared to some open-source alternatives. Overall, Langfuse is appreciated as a valuable tool for observability in AI, though it faces some competition from both paid and open-source tools offering varied features.
Features
Use Cases
Industry
information technology & services
Employees
19
Funding Stage
Merger / Acquisition
Total Funding
$4.1M
828
GitHub followers
18
GitHub repos
24,100
GitHub stars
20
npm packages
22
HuggingFace models
870,710
npm downloads/wk
19,249,322
PyPI downloads/mo
OpenTelemetry just standardized LLM tracing. Here's what it actually looks like in code.
Every LLM tool invents its own tracing format. Langfuse has one. Helicone has one. Arize has one. If...
View originalPricing found: $29 / month, $8/100k, $199 / month, $8/100k, $300/mo
How are people actually tracking OpenAI costs in production?
Curious what this community actually uses for OpenAI cost monitoring on real production apps. There are a lot of "I got a $X surprise bill" posts here, but I rarely see the follow-up: what tooling did people land on after the wake-up call? For those running OpenAI in production: - Real-time tracking or just checking the billing dashboard monthly? - Rolling your own or using a tool (Helicone, Langfuse, etc.)? - Breaking costs down per user / per feature, or just looking at the total? Asking because I'm building in this space and trying to figure out what people actually do vs. what they say they should do. submitted by /u/VariousHour7390 [link] [comments]
View originalBuilt a Claude Code monitoring tool
Built a lightweight monitoring & observability tool for Claude Code, runs inside VSCode. Repo: https://github.com/yessGlory17/argus Quick demo: https://www.youtube.com/watch?v=HmHOI1PBn_M If Argus helps you ship better Claude Code sessions, I would greatly appreciate a GitHub Star. submitted by /u/fIak88 [link] [comments]
View originalAnyone actually built a real feedback loop for Claude agents in production? Because "run evals and pray" isn't cutting it
So I've been running a multi-agent setup with Claude for a few months now mostly customer-facing stuff, some internal tooling. And i keep hitting this problem that I think a lot of people here are probably dealing with too but nobody really talks about. You ship a prompt change. Or you swap from Sonnet to Opus for one step in the chain. Or you add a new tool. Everything looks fine in your evals. You push it. Then three days later someone on the team notices the agent is subtly doing something wrong not catastrophically wrong, just... You can sense something's off. Maybe it stopped including a specific field in its output. Maybe it started being way too verbose in one branch of the logic. Whatever it is, it's not a crash, it's a vibe shift. And then you're sitting there doing archaeology on your own system. Manually diffing outputs, reading through traces, asking teammates "hey did you notice anything weird last Tuesday." It's miserable. I've been thinking a lot about what the fastest feedback loop in agent engineering that almost nobody is running actually looks like. Because right now my loop is: ship change → wait for someone to complain → investigate → fix → hope I didn't break something else That's... pre-CI/CD era thinking applied to agents. And it's wild that this is where most of us are at. The thing is, traditional software solved this ages ago. You write tests, you run them in CI, you get red/green before merge. But agents are so much messier. Outputs are non-deterministic, "correct" is fuzzy, and the failure modes are subtle behavioral drift rather than stack traces. So most teams I talk to (including mine honestly) end up relying on vibes. Does the agent feel like it's working? Cool, ship it. What I actually want is something that: Watches production behavior continuously Notices when things drift from expected patterns Connects the regression to the specific change that caused it Tells me before a customer does Ideally feeds that learning back so the same failure doesn't happen again I have tracing set up (Langfuse). It's good for what it does. But it still feels like it stops at "here's what happened" rather than "here's what went wrong and why." I generate a ton of observability data that nobody looks at until something is already broken. The closed-loop part where the system actually learns from failures that's what's missing. I've been looking at a few things. LangSmith, Arize, Braintrust... they all cover pieces of this. Recently stumbled on Bento which seems to be trying to do the full closed-loop thing — tracing + regression detection + feeding fixes back into the system. Haven't gone deep enough to know if it actually delivers on that promise but the framing resonates with what I'm trying to build. If anyone's tried it i'd be curious to hear. But honestly I'm more interested in hearing what people here have actually built or cobbled together. Like: - Are you running evals against production traffic or just pre-deploy? - How do you detect behavioral drift that isn't an outright error? - When you find a regression, how do you trace it back to which change caused it? - Has anyone built something where the agent actually gets better from production failures automatically rather than you manually tweaking prompts? I feel like this is the unsexy infrastructure problem that's going to separate teams who can actually run agents reliably from teams who are perpetually firefighting. But maybe I'm overthinking this and everyone's just vibing their way through production lol Would love to hear what your setups look like, especially if you're running Claude agents at any kind of scale where you can't just eyeball every interaction. submitted by /u/Fine-Discipline-818 [link] [comments]
View originalworking on a small add-on that tells me what actually mattered in a session, would love feedback!
https://preview.redd.it/mrdha7g6xfwg1.png?width=1504&format=png&auto=webp&s=464cc2ddcbcdbce6664a6c687942559131ac7e26 I’ve been working on a small Claude Code add-on because I keep having the same experience: the task finishes, it mostly works, and I’m still left wondering what it actually did along the way. I know there are already some good ways to get more visibility into Claude Code: - OTEL / Langfuse setups - local dashboards - session timelines - cost / usage monitoring Those all seem useful if you want raw telemetry, team usage, or deeper debugging. But for my own use, a lot of that feels heavier than what I actually want day to day. Most of the time I’m not asking: “show me every event” I’m asking: - what looked weird? - what got blocked? - what did it touch outside the task? - what should I actually review before I trust this? That’s what I’m trying to build with Clawrity. The current idea is a local hook-based reviewer that gives me a short summary after a session, something like: what matters - touched auth/session.ts even though the task was a billing form fix - ran 6 shell commands, including npm install - attempted to read .env; blocked - retried the same migration 3 times review first src/auth/session.ts db/migrations/2026_04_20_add_status.sql package.json So not a dashboard, not a tracing sink, not “more logs.” More like: “ok, what actually deserves my attention before approving this and moving on?” Still early, but I’d really love feedback from people using Claude in more advanced ways than I :) - would you actually want this? - where do existing tools already solve this well enough? - what would make this useful vs just noisy? submitted by /u/Relevant_Decision989 [link] [comments]
View originalI built an open-source tool that shows exactly where your Claude Code tokens go
I was spending $200+/month on Claude Code with zero visibility into where the money went. So I built AgentTrace. Existing tools (LangSmith, Langfuse) trace LLM calls — prompt in, completion out. But when your agent spawns 3 sub-agents that read 40 files, search 5 URLs, and retry tests 3 times, you need to know: which decisions were worth the money? AgentTrace traces agent DECISIONS, not API calls. It builds a decision tree showing what each agent chose to do, what it cost, and whether it contributed to the outcome. One command setup: `npm install -g agenttrace-sdk && agenttrace init` Every Claude Code session auto-generates a cost report showing effective spend vs waste, with actionable recommendations and projected weekly savings. Example: a $1.97 session showed 42% waste — research agent read 6 irrelevant files, docs agent fetched 4 redundant pages, 2 test failures from missing env vars. Each finding comes with a specific fix. Open source, MIT licensed. Would love feedback from this community since you're the ones actually spending on Claude Code daily. submitted by /u/Intrepid_Income6025 [link] [comments]
View originalOpenTelemetry just standardized LLM tracing. Here's what it actually looks like in code.
Every LLM tool invents its own tracing format. Langfuse has one. Helicone has one. Arize has one. If...
View originalBuilt an open-source Agent Firewall to see what Claude Code & MCP servers are actually doing on your machine
I built this after realizing Claude Code was autonomously modifying files, calling APIs, and interacting with my MCP servers—and I had zero visibility into what was happening or why. Unalome Agent Firewall is a free, local-first desktop app (Tauri v2 + Rust + React, Apache 2.0) that runs entirely on your machine and gives you real-time visibility into: What it does: - Auto-detects Claude Code, Claude Desktop, running MCP servers - Real-time action timeline—see every file change, API call, connection - Auto-backup files before agent modifications + one-click restore - PII Guardian—scans for exposed API keys, passwords, credit cards - Connection Monitor—logs outbound traffic, flags unknown domains - Cost Tracker—per-model spend across 40+ Claude models + budget limits - Kill Switch—pause Claude Code or any MCP server instantly - MCP Security Scanner—detects prompt injection, dangerous capabilities - Weekly Activity Report—exportable, shareable HTML summary Why I built this: The transparency gap felt critical. Claude Code can read/write files, execute code, interact with MCP servers, and I realized I had no structured way to audit what it actually did. Existing tools (LangSmith, Langfuse) are built for production teams; nothing existed for an individual developer who just wants to know: what did my agent do? Plus, the MCP security landscape in 2025 is rough. Real-world attacks via tool poisoning and prompt injection have exfiltrated private repo code, API keys, and chat histories. A scan of 2,614 MCP implementations found 82% vulnerable to path traversal. The issue: users had no visibility into what was happening. Status: - v0.1.0 fully built & signed (macOS: signed + notarized; Linux: .deb/.rpm/.AppImage; Windows: .msi/.exe) - Open-source, Apache 2.0 - Repo: https://github.com/unalome-ai/unalome-firewall Happy to discuss the MCP detection approach, Tauri/Rust stack, or how to extend support for other agents. Feedback welcome—especially on what other Claude integrations people want covered. submitted by /u/Status_Degree_6469 [link] [comments]
View originalMy chatbot switches from text to voice mid-conversation. same memory, same context, you just start talking. 2 months of Claude, open-sourcing it for you to try.
been building this since late january. started as a weekend RAG chatbot so visitors could ask about my work. it answers from my case studies. that part was straightforward. then i kept going and it turned into the best learning experience i've had with Claude. still a work in progress. there are UI bugs i'm fixing and voice mode has edge cases. but the architecture is solid and you can try it right now. the whole thing was built with Claude Code. the chatbot runs on Claude Sonnet, and Claude Code wrote most of the codebase including the eval framework. two months of building every other day and i've learned more about production LLM systems than in any course. here's what's in it: streaming responses. tokens come in one by one, not dumped as a wall of text. i tuned the speed so you can actually follow along as it writes. fast enough to feel responsive, slow enough to read comfortably. like watching it think. text to voice mid-conversation. you're chatting with those streaming responses, and at any point you hit the mic and just start talking. same context, same memory. OpenAI Realtime API handles speech-to-speech. keeping state synced between both modes was the hardest part to get right. RAG with contextual links. the chatbot doesn't just answer. when it pulls from a case study, it shows you a clickable link to that article right in the conversation. every new article i publish gets indexed automatically via RAG. i don't touch the prompt. the chatbot learns new content on its own just by me publishing it. 71 automated evals across 10 categories. factual accuracy, safety/jailbreak, RAG quality, source attribution, multi-turn, voice quality. every PR runs the full suite. i broke prod twice before building this. 53 of the 71 evals exist because something actually broke. the system writes tests from its own failures. 6-layer defense against prompt injection. keyword detection, canary tokens, fingerprinting, anti-extraction, online safety scoring (Haiku rates every response in background), and an adversarial red team that auto-generates 20+ attack variants. someone tried to jailbreak it after i shared it on linkedin. that's when i took security seriously. observability dashboard. every decision the pipeline makes gets traced in Langfuse: tool_decision, embedding, retrieval, reranking, generation. built a custom dashboard with 8 tabs to monitor it all. stack: Claude Sonnet (generation + tool_use), OpenAI embeddings (pgvector), Haiku (background safety scoring), Langfuse, Supabase, Vercel. like i said, it's not perfect. some UI rough edges, voice mode still needs polish on certain browsers. but the core works and everything is in the repo. repo: github.com/santifer/cv-santiago (the repo has everything. RAG pipeline, defense layers, eval suite, prompt templates, voice mode). feel free to clone and try. happy to answer questions. submitted by /u/Beach-Independent [link] [comments]
View originalAsk HN: How are you monitoring AI agents in production?
With the recent incidents (DataTalks database wipe by Claude Code, Replit agent deleting data during code freeze), it's clear that running AI agents in production without observability is risky.<p>Common failure modes I've seen: no visibility into what the agent did step-by-step, surprise LLM bills from untracked token usage, risky outputs going undetected, and no audit trail for post-mortems.<p>I've been building AgentShield (https://useagentshield.com) — an observability SDK for AI agents. It does execution tracing, risk detection on outputs, cost tracking per agent/model, and human-in-the-loop approval for high-risk actions. Plugs into LangChain, CrewAI, and OpenAI Agents SDK with a 2-line integration.<p>Curious what others are using. Rolling your own monitoring? LangSmith? Langfuse? Or just hoping for the best?
View originalShow HN: AgentLens – Open-source observability for AI agents
Hi HN,<p>I built AgentLens because debugging multi-agent systems is painful. LangSmith is cloud-only and paid. Langfuse tracks LLM calls but doesn't understand agent topology — tool calls, handoffs, decision trees.<p>AgentLens is a self-hosted observability platform built specifically for AI agents:<p>- *Topology graph* — see your agent's tool calls, LLM calls, and sub-agent spawns as an interactive DAG - *Time-travel replay* — step through an agent run frame-by-frame with a scrubber timeline - *Trace comparison* — side-by-side diff of two runs with color-coded span matching - *Cost tracking* — 27 models priced (GPT-4.1, Claude 4, Gemini 2.0, etc.) - *Live streaming* — watch spans appear in real-time via SSE - *Alerting* — anomaly detection for cost spikes, error rates, latency - *OTel ingestion* — accepts OTLP HTTP JSON, so any OTel-instrumented app works<p>Works with LangChain, CrewAI, AutoGen, LlamaIndex, and Google ADK.<p>Tech: React 19 + FastAPI + SQLite/PostgreSQL. MIT licensed. 231 tests, 100% coverage.<p><pre><code> docker run -p 3000:3000 tranhoangtu/agentlens-observe:0.6.0 pip install agentlens-observe </code></pre> Demo GIF and screenshots in the README.<p>GitHub: <a href="https://github.com/tranhoangtu-it/agentlens-observe" rel="nofollow">https://github.com/tranhoangtu-it/agentlens-observe</a> Docs: <a href="https://agentlens-observe.pages.dev" rel="nofollow">https://agentlens-observe.pages.dev</a><p>I'd love feedback on the trace visualization approach and what features matter most for your agent debugging workflow.
View originalRepository Audit Available
Deep analysis of langfuse/langfuse — architecture, costs, security, dependencies & more
Pricing found: $29 / month, $8/100k, $199 / month, $8/100k, $300/mo
Key features include: Gain deep visibility into your traces.
Langfuse is commonly used for: Monitoring LLM performance in production, Tracking API usage and costs, Analyzing user interactions with LLMs, Identifying bottlenecks in LLM workflows, Debugging multi-agent systems, Optimizing LLM response times.
Langfuse integrates with: OpenAI, AWS Lambda, Clickhouse, Slack, Zapier, GitHub, Google Cloud Platform, Microsoft Azure, Jira, Trello.
Langfuse has a public GitHub repository with 24,100 stars.
Nov 25, 2025
Based on user reviews and social mentions, the most common pain points are: cost tracking, surprise bill, cost monitoring, usage monitoring.
Based on 15 social mentions analyzed, 27% of sentiment is positive, 73% neutral, and 0% negative.