View in LangSmith
LangSmith is recognized for its capabilities in providing observability for AI agents, a necessary feature due to the risk associated with running these agents in production environments. A key complaint highlighted is that LangSmith is a cloud-only service with paid access, which may not be ideal for all users, especially those preferring open-source alternatives. The general sentiment around its pricing is somewhat negative, as users express a preference for non-commercial options. Overall, LangSmith appears to have a solid reputation for its functional strengths but faces criticism regarding its availability and cost structure.
Mentions (30d)
0
Reviews
0
Platforms
3
Sentiment
22%
2 positive
LangSmith is recognized for its capabilities in providing observability for AI agents, a necessary feature due to the risk associated with running these agents in production environments. A key complaint highlighted is that LangSmith is a cloud-only service with paid access, which may not be ideal for all users, especially those preferring open-source alternatives. The general sentiment around its pricing is somewhat negative, as users express a preference for non-commercial options. Overall, LangSmith appears to have a solid reputation for its functional strengths but faces criticism regarding its availability and cost structure.
Features
Use Cases
Industry
information technology & services
Employees
98
Funding Stage
Series B
Total Funding
$260.0M
Ask HN: How are you monitoring AI agents in production?
With the recent incidents (DataTalks database wipe by Claude Code, Replit agent deleting data during code freeze), it's clear that running AI agents in production without observability is risky.<p>Common failure modes I've seen: no visibility into what the agent did step-by-step, surprise LLM bills from untracked token usage, risky outputs going undetected, and no audit trail for post-mortems.<p>I've been building AgentShield (https://useagentshield.com) — an observability SDK for AI agents. It does execution tracing, risk detection on outputs, cost tracking per agent/model, and human-in-the-loop approval for high-risk actions. Plugs into LangChain, CrewAI, and OpenAI Agents SDK with a 2-line integration.<p>Curious what others are using. Rolling your own monitoring? LangSmith? Langfuse? Or just hoping for the best?
View originalAnyone actually built a real feedback loop for Claude agents in production? Because "run evals and pray" isn't cutting it
So I've been running a multi-agent setup with Claude for a few months now mostly customer-facing stuff, some internal tooling. And i keep hitting this problem that I think a lot of people here are probably dealing with too but nobody really talks about. You ship a prompt change. Or you swap from Sonnet to Opus for one step in the chain. Or you add a new tool. Everything looks fine in your evals. You push it. Then three days later someone on the team notices the agent is subtly doing something wrong not catastrophically wrong, just... You can sense something's off. Maybe it stopped including a specific field in its output. Maybe it started being way too verbose in one branch of the logic. Whatever it is, it's not a crash, it's a vibe shift. And then you're sitting there doing archaeology on your own system. Manually diffing outputs, reading through traces, asking teammates "hey did you notice anything weird last Tuesday." It's miserable. I've been thinking a lot about what the fastest feedback loop in agent engineering that almost nobody is running actually looks like. Because right now my loop is: ship change → wait for someone to complain → investigate → fix → hope I didn't break something else That's... pre-CI/CD era thinking applied to agents. And it's wild that this is where most of us are at. The thing is, traditional software solved this ages ago. You write tests, you run them in CI, you get red/green before merge. But agents are so much messier. Outputs are non-deterministic, "correct" is fuzzy, and the failure modes are subtle behavioral drift rather than stack traces. So most teams I talk to (including mine honestly) end up relying on vibes. Does the agent feel like it's working? Cool, ship it. What I actually want is something that: Watches production behavior continuously Notices when things drift from expected patterns Connects the regression to the specific change that caused it Tells me before a customer does Ideally feeds that learning back so the same failure doesn't happen again I have tracing set up (Langfuse). It's good for what it does. But it still feels like it stops at "here's what happened" rather than "here's what went wrong and why." I generate a ton of observability data that nobody looks at until something is already broken. The closed-loop part where the system actually learns from failures that's what's missing. I've been looking at a few things. LangSmith, Arize, Braintrust... they all cover pieces of this. Recently stumbled on Bento which seems to be trying to do the full closed-loop thing — tracing + regression detection + feeding fixes back into the system. Haven't gone deep enough to know if it actually delivers on that promise but the framing resonates with what I'm trying to build. If anyone's tried it i'd be curious to hear. But honestly I'm more interested in hearing what people here have actually built or cobbled together. Like: - Are you running evals against production traffic or just pre-deploy? - How do you detect behavioral drift that isn't an outright error? - When you find a regression, how do you trace it back to which change caused it? - Has anyone built something where the agent actually gets better from production failures automatically rather than you manually tweaking prompts? I feel like this is the unsexy infrastructure problem that's going to separate teams who can actually run agents reliably from teams who are perpetually firefighting. But maybe I'm overthinking this and everyone's just vibing their way through production lol Would love to hear what your setups look like, especially if you're running Claude agents at any kind of scale where you can't just eyeball every interaction. submitted by /u/Fine-Discipline-818 [link] [comments]
View originalAsk HN: How are you monitoring AI agents in production?
With the recent incidents (DataTalks database wipe by Claude Code, Replit agent deleting data during code freeze), it's clear that running AI agents in production without observability is risky.<p>Common failure modes I've seen: no visibility into what the agent did step-by-step, surprise LLM bills from untracked token usage, risky outputs going undetected, and no audit trail for post-mortems.<p>I've been building AgentShield (https://useagentshield.com) — an observability SDK for AI agents. It does execution tracing, risk detection on outputs, cost tracking per agent/model, and human-in-the-loop approval for high-risk actions. Plugs into LangChain, CrewAI, and OpenAI Agents SDK with a 2-line integration.<p>Curious what others are using. Rolling your own monitoring? LangSmith? Langfuse? Or just hoping for the best?
View originalShow HN: AgentLens – Open-source observability for AI agents
Hi HN,<p>I built AgentLens because debugging multi-agent systems is painful. LangSmith is cloud-only and paid. Langfuse tracks LLM calls but doesn't understand agent topology — tool calls, handoffs, decision trees.<p>AgentLens is a self-hosted observability platform built specifically for AI agents:<p>- *Topology graph* — see your agent's tool calls, LLM calls, and sub-agent spawns as an interactive DAG - *Time-travel replay* — step through an agent run frame-by-frame with a scrubber timeline - *Trace comparison* — side-by-side diff of two runs with color-coded span matching - *Cost tracking* — 27 models priced (GPT-4.1, Claude 4, Gemini 2.0, etc.) - *Live streaming* — watch spans appear in real-time via SSE - *Alerting* — anomaly detection for cost spikes, error rates, latency - *OTel ingestion* — accepts OTLP HTTP JSON, so any OTel-instrumented app works<p>Works with LangChain, CrewAI, AutoGen, LlamaIndex, and Google ADK.<p>Tech: React 19 + FastAPI + SQLite/PostgreSQL. MIT licensed. 231 tests, 100% coverage.<p><pre><code> docker run -p 3000:3000 tranhoangtu/agentlens-observe:0.6.0 pip install agentlens-observe </code></pre> Demo GIF and screenshots in the README.<p>GitHub: <a href="https://github.com/tranhoangtu-it/agentlens-observe" rel="nofollow">https://github.com/tranhoangtu-it/agentlens-observe</a> Docs: <a href="https://agentlens-observe.pages.dev" rel="nofollow">https://agentlens-observe.pages.dev</a><p>I'd love feedback on the trace visualization approach and what features matter most for your agent debugging workflow.
View originalEngineering the Autonomous Local Enterprise: A Technical Blueprint for Agentic RAG and Sovereign AI Infrastructure
# Engineering the Autonomous Local Enterprise: A Technical Blueprint for Agentic RAG and Sovereign AI Infrastructure The transition from reactive large language model applications to autonomous agentic workflows represents a fundamental paradigm shift in enterprise computing. In the 2025–2026 technological landscape, the industry has moved beyond simple chat interfaces toward systems capable of planning, executing, and refining multi-step workflows over extended temporal horizons. This evolution is underpinned by the convergence of high-performance local inference, sophisticated document understanding, and multi-agent orchestration frameworks that operate within a "sovereign stack"—an infrastructure entirely controlled by the organization to ensure data privacy, security, and operational resilience. The architecture of such a system requires a nuanced understanding of hardware constraints, the mathematical implications of model quantization, and the systemic challenges of retrieving context from high-volume, complex document sets. # Executive Summary: The Rise of Sovereign Intelligence The contemporary AI landscape is increasingly bifurcated between centralized cloud-based services and a burgeoning movement toward decentralized, sovereign intelligence. For organizations managing sensitive intellectual property, legal documents, or healthcare data, the reliance on third-party APIs introduces unacceptable risks regarding data residency, privacy, and long-term cost volatility. The primary mission of this report is to define the architecture for a fully local, production-ready system that leverages the most advanced open-source components from GitHub and Hugging Face. The proposed system integrates high-fidelity document ingestion, a multi-stage RAG pipeline, and an agentic orchestration layer capable of long-horizon reasoning. By utilizing reasoning models such as DeepSeek-R1 and Llama 3.3, and optimizing them through advanced quantization, the enterprise can achieve performance levels previously reserved for high-cost cloud providers. This architecture is further enhanced by comprehensive observability through the OpenTelemetry standard, ensuring that every reasoning step and retrieval operation is transparent and verifiable. # Phase 1: The Local Discovery Engine Identifying the optimal components for a local sovereign stack requires a rigorous evaluation of active maintenance, documentation quality, and community health. The following repositories and transformers represent the current state-of-the-art for local LLM deployment with agentic RAG. # Top GitHub Repositories for Local Agentic RAG |**Repository**|**Stars**|**Last Updated**|**Primary Language**|**Key Strength**|**Critical Limitation**| |:-|:-|:-|:-|:-|:-| |**langchain-ai/langchain**|125,000|2026-01|Python/TS|700+ integrations; modular agentic workflows.|High abstraction complexity; steep learning curve.| |**langgenius/dify**|114,000|2026-01|Python/TS|Visual drag-and-drop workflow builder; built-in RAG.|Less flexibility for custom low-level Python hacks.| |**infiniflow/ragflow**|70,000|2025-12|Python|Deep document understanding; visual chunk inspection.|Resource-heavy; requires robust GPU for layout parsing.| |**run-llama/llama\_index**|46,500|2025-12|Python/TS|Superior data indexing; 150+ data connectors.|Transition from ServiceContext to Settings can be confusing.| |**zylon-ai/private-gpt**|52,000|2025-11|Python|Production-ready; 100% offline; OpenAI API compatible.|Gradio UI is basic; designed primarily for document Q&A.| |**Mintplex-Labs/anything-llm**|25,000|2026-01|Node.js|All-in-one desktop/Docker app; multi-user support.|Workspace-based isolation can limit cross-context queries.| |**DSProject/Docling**|12,000|2026-01|Python|Industry-leading table extraction (97.9% accuracy).|Speed scales linearly with page count (slower than LlamaParse).| # Top Hugging Face Transformers for Reasoning and RAG |**Model**|**Downloads**|**Task**|**Base Model**|**Params**|**Hardware (4-bit)**|**Fine-tuning**| |:-|:-|:-|:-|:-|:-|:-| |**DeepSeek-R1-Distill-Qwen-32B**|2.1M|Reasoning|Qwen 2.5|32.7B|24GB VRAM (RTX 4090).|Yes (LoRA).| |**DeepSeek-R1-Distill-Llama-70B**|1.8M|Reasoning|Llama 3.3|70.6B|48GB VRAM (2x 4090).|Yes (LoRA).| |**Llama-3.3-70B-Instruct**|5.5M|General/RAG|Llama 3.3|70B|48GB VRAM (2x 4090).|Yes.| |**Qwen 2.5-72B-Instruct**|3.2M|Coding/RAG|Qwen 2.5|72B|48GB VRAM.|Yes.| |**Ministral-8B-Instruct**|800K|Edge RAG|Mistral|8B|8GB VRAM (RTX 3060).|Yes.| # Phase 2: Hardware Topographies and Inference Optimization The viability of local intelligence is strictly dictated by the memory bandwidth and VRAM capacity of the deployment target. In 2025, the release of the NVIDIA RTX 5090 introduced a significant leap in local capability, featuring 32GB of GDDR7 memory and a bandwidth of approximately 1,792 GB/s, representing a 77% improvement over its predecessor. # The Physics of Inference: Bandwidth vs. Compute A detailed 2025 NVIDIA research pap
View originalKey features include: Agent debugging tools, Performance monitoring dashboards, Real-time observability metrics, Error tracking and reporting, Agent performance evaluation, Deployment management for AI agents, Customizable alerting system, Integration with CI/CD pipelines.
LangSmith is commonly used for: Monitoring AI agent performance in production, Debugging issues in multi-agent systems, Evaluating the effectiveness of AI agents, Preventing data loss in AI applications, Managing deployment of AI agents, Integrating observability into CI/CD workflows.
LangSmith integrates with: OpenAI, AWS Lambda, Google Cloud Platform, Microsoft Azure, Slack, Jira, GitHub, CircleCI, Docker, Kubernetes.
Based on user reviews and social mentions, the most common pain points are: cost tracking, token usage.