Harness AI Review — Features, Pricing & User Sentiment | Payloop

Harness AI

ai-devopscicdsubscription + freemium + per-seat + tieredFree tier

Harness is a unified, end-to-end AI software delivery platform to manage the SDLC using purpose-built AI agents.

Users of Harness AI appreciate its multi-agent architecture, particularly its capacity for enhancing long-running applications through autonomous iterations. However, there are few noted discussions about its replication and implementation rather than comprehensive user reviews. Pricing sentiment is not explicitly discussed, but given the open-source nature, it might be perceived as cost-effective for developers. Overall, Harness AI has a positive reputation among developers for its capability to optimize and automate complex coding tasks, though it's primarily discussed in niche technical communities.

Mentions (30d)

44

9 this week

Reviews

0

Platforms

2

Sentiment

14%

17 positive

Pain Score: 2/10020 integrations10 featuresSeries E

Voices Discussing Harness AI

Lisa Su

CEO at AMD

1 mention

The AI Index

Research at Stanford HAI

1 mention

Latest Videos

Load Testing Vs Stress Testing | Resilience Testing | Harness

Load Testing Vs Stress Testing | Resilience Testing | Harness

Apr 9, 2026

Enable self-service environments with Harness Internal Developer Portal

Enable self-service environments with Harness Internal Developer Portal

Apr 8, 2026

Share:Twitter LinkedIn

Product Screenshots

Harness AI screenshot 1

AI Summary

Users of Harness AI appreciate its multi-agent architecture, particularly its capacity for enhancing long-running applications through autonomous iterations. However, there are few noted discussions about its replication and implementation rather than comprehensive user reviews. Pricing sentiment is not explicitly discussed, but given the open-source nature, it might be perceived as cost-effective for developers. Overall, Harness AI has a positive reputation among developers for its capability to optimize and automate complex coding tasks, though it's primarily discussed in niche technical communities.

Features & Use Cases

Features

Continuous Delivery GitOpsContinuous IntegrationInternal Developer PortalInfrastructure as Code ManagementDatabase DevOpsArtifact RegistryAI Test AutomationResilience TestingFeature Management ExperimentationAI SRE

Use Cases

Automate CI/CD pipelines for multi-cloud deploymentsAccelerate developer onboarding with enterprise-grade IDPIntegrate database changes into deployment pipelinesImplement AI-powered predictive analytics for software releasesModernize end-to-end testing with AI test authoringUtilize feature flags for controlled software releasesEnhance security by identifying vulnerabilities in the SDLCOptimize cloud spending with AI-driven recommendations

Company Intel

Industry

information technology & services

Employees

1,700

Funding Stage

Series E

Total Funding

$802.1M

Top Mention

reddit@killerexelon102 engagement5/16/2026

I replicated Anthropic's Generator-Evaluator harness to build a website through 12 adversarial AI iterations - here's the result and what I learned

Anthropic recently published their [harness design for long-running apps](https://www.anthropic.com/engineering/harness-design-long-running-apps) — a multi-agent architecture inspired by GANs where a Generator builds code and an Evaluator critiques it in a loop. I built my own version using Kiro CLI and used it to generate a marketing website for my project [Mnemo](https://github.com/Mnemo-mcp/Mnemo) (persistent memory for AI coding agents). **The architecture:** Planner (runs once) → Generator ↔ Evaluator (12 iterations) Each agent is a separate CLI process with zero shared context. They communicate only through files (spec.md, eval-report.md). The Evaluator uses Playwright to actually browse the live site — not just read code. **What made it work:** **Clean slate per invocation** — each agent starts fresh, reads only its input files. Prevents context anxiety. **Playwright MCP for testing** — the evaluator navigates, clicks, resizes viewports. Catches visual bugs code review never would. **Anthropic's frontend design skill** — explicitly penalizes generic AI patterns (Inter font, purple gradients, card layouts). Forces creative risk-taking. **Continuous iteration, not retry-on-failure**— all 12 rounds run regardless. Each one improves. **The progression was wild:** Iteration 1: Exactly what you'd expect from AI — functional but forgettable Iteration 4: Generator pivoted to "Terminal Noir" — IBM Plex Mono, amber on black, grain textures, scanlines. This is the kind of creative leap that doesn't happen in single-shot generation. Iterations 5-12: Polish, accessibility, responsive fixes, reduced-motion support **Stats:** Total time: 3h 20min Iterations: 12 (generator + evaluator each) Manual code written: 0 lines (I fixed a few visual issues after) Tech: Next.js, Tailwind, Framer Motion, TypeScript **Live result:** [https://mnemo-mcp.github.io/Mnemo/](https://mnemo-mcp.github.io/Mnemo/) Documentation : https://github.com/Mnemo-mcp/Harness **Key takeaway:** The model is the engine. The harness — the constraints, feedback loops, and adversarial structure around it — is what determines whether you get AI slop or something genuinely distinctive.

Mentions by Platform

youtube

Harness AI AI

Harness AI AI

youtube

Harness AI AI

Harness AI AI

youtube

Harness AI AI

Harness AI AI

youtube

Harness AI AI

Harness AI AI

youtube

Harness AI AI

Harness AI AI

Pricing

subscription + freemium + per-seat + tieredFree tier available

Mention Activity (Last 12 Weeks)

Platform Distribution

Sentiment Overview

Positive14% (17)

Neutral85% (106)

Negative1% (1)

Common Pain Points

token usage (6)token cost (2)budget exceeded (2)cost visibility (1)cost tracking (1)API bill (1)API costs (1)expensive API (1)

Top Topics

model selection (18)open source (16)agents (15)workflow (13)documentation (11)support (11)accuracy (11)performance (11)cost optimization (11)scalability (10)data privacy (10)RAG (10)api (9)streaming (9)pricing (9)security (6)ease of use (5)migration (5)developer experience (3)deployment (2)

Recent Mentions

youtube

Harness AI AI

Harness AI AI

youtube

Harness AI AI

Harness AI AI

youtube

Harness AI AI

Harness AI AI

youtube

Harness AI AI

Harness AI AI

youtube

Harness AI AI

Harness AI AI

reddit@[unknown]6/17/2026

A 4b model is now beating 30b ones at web research and the reason is not size

A small thing from this month's model releases stuck with me more than the usual flagship leaderboard race, because it points at where the interesting progress actually is. A 4 billion parameter open model reportedly beat every open source model in the 30 billion class on a couple of hard web research benchmarks. Not matched, beat. A model you could run on a laptop outperforming ones roughly eight times its size on the specific task of going out, reading sources, and answering a multi step question. The reason that is interesting is the why. For the last couple of years the implied formula was straightforward, more parameters, more capability, and the leaderboard mostly cooperated. A result like this says the relationship is a lot looser than that for some skills. The claim from the people who built it is that research ability came from careful construction of the training data and from teaching the model to check and revise its own work, rather than from raw scale. In other words how you train a small model for a task can matter more than how big a generic model you throw at it. This particular one comes from a family, apodex, that is built around the idea of a system verifying its own answers before committing to them, and the small open versions seem to inherit that habit even though the headline flagship is a much larger closed model. Why this matters if you are not training models yourself. The expensive, capable research assistants have mostly lived behind apis you pay per query for. If a small model that runs on ordinary hardware can do a real chunk of that work, the cost and access picture changes for students, small teams, anyone in a place where the paid services are pricey or just unavailable. It also means the gap between what a big lab can do and what a hobbyist can run locally is narrower on some tasks than the flagship marketing suggests, which is healthy for the field. The caveat is the obvious one, a benchmark win is not the same as being reliable on your actual question, and the small model is not going to match the big hosted system on the genuinely hard stuff. But the direction is the part worth watching. If the lever for capability on a given task is data quality and training method rather than parameter count, a lot more of this becomes reproducible by people who are not sitting on a giant compute budget. That is a more democratic trajectory than the last two years pointed at, and it is showing up in things you can actually download now. EDIT: A few people asked for the model and sources, so here they are. Model card: https://huggingface.co/apodex/Apodex-1.0-4B-SFT Technical blog: https://www.apodex.com/blog/apodex-1.0 Evaluation harness: https://github.com/ApodexAI/AgentHarness submitted by /u/No-Fact-8828 [link] [comments]

reddit@[unknown]6/10/2026

Made an open source Claude desktop app but for any harness

Repo: github.com/proliferate-ai/proliferate, can download at proliferate.com I kept bouncing between Codex, OpenCode and Claude Code, but got tired of switching harnesses every few days, so I built a desktop app that runs all of them in one place. What it does: runs each agent through its actual harness, not a reimplementation isolated worktree per task so a few agents can run at once without stepping on each other set up MCPs and skills once, every agent gets them agents can delegate to each other (I have Codex throw design questions at Claude Code constantly) point any harness at whatever model you want works over SSH too, so agents can run on whatever box you've got review agents that check plans and diffs before I bother looking automations, basically any agent on a schedule Honestly a big part of why I built it: after the Gemini CLI thing, and now with Fable quietly nerfing requests it flags, want to feel like I can own the application layer I use for agents day to day. Curious if people have any feature requests or things you'd want in an open source desktop app for agaents. submitted by /u/Content_Balance3150 [link] [comments]

reddit@[unknown]6/7/2026

What started as a Claude Code scaffolding repo is now a full open-source AI harness (Maggy)

Last time I posted here it was about v5, the blast-score routing and a benchmark where it used 83% less Claude and still hit 100% success. A few people asked how it got to that point, so here's the longer version. Heads up first: I started this as a scaffolding repo, not a product. Every new project I'd end up re-teaching Claude Code the same stuff, coding standards, TDD, security gates, which CLIs to reach for. So I dumped it all into one place you drop into any repo with a single command. Run /initialize-project and the project just knows your conventions. That was the whole idea, make Claude Code consistent across projects. It kept growing from there. Every time I needed something day to day it ended up in the repo, and at some point it stopped being scaffolding and turned into an actual harness. It has a name now, Maggy. The short version of the arc: v3.6 cross-agent intelligence (Claude/Kimi/Codex/Ollama share skills + hooks) v4.0 Polyphony: container-isolated multi-agent orchestration (173 tests) v5.0 blast-score routing + self-correcting rules (596 tests) now one-config model routing, prompt pre-analysis, build-in-public agent What it does today: a local dashboard plus CLI that auto-bootstraps on startup. Every task gets a complexity score and goes to the cheapest model that can actually handle it, ollama and kimi for the easy stuff, codex in the middle, Claude for the hard or security-critical work. The routing rules live in YAML and correct themselves based on what actually worked. On top of that there's an intent graph that tracks why code exists and flags when the implementation drifts from it, a typed memory layer so goals survive context compaction, and a plugin system that auto-discovers anything you drop in. A few things landed since the v5 post that I'm happy with. You now pick your main model once and everything respects it, the hooks inside Claude Code, Maggy's own routing, and srooter (a gateway you can point Codex or anything Anthropic/OpenAI-compatible at). No setting it in five places, and cheap stuff still stays local. Every prompt also gets a quick pre-pass now. A fast model reads it and writes a short intent / scope / risks / approach note that gets handed to Claude before it starts, so it's working from a plan instead of cold. And the meta one: Maggy also has plugins support e.g one of the plugin is build-in-public which monitors updates to maggy or any project being built with maggy and posts updates on LinkedIn, X and Reddit. Worth being straight about the tradeoffs. It's one person's harness that grew organically, so it's broad and some corners are rough. The v5 benchmark caught real gaps, local models are bad at prose and nothing was writing tests, both fixed with force-routes now. Quality lands a hair under pure Claude, 7.4 vs 7.8 in that benchmark, for 83% less premium spend. Not a free lunch, just a tradeoff I'll take most days. Moving my focus fully onto Maggy from here. Repo: https://www.github.com/alinaqi/maggy . Clone it, run ./install.sh, then /initialize-project in any Claude Code session. /maggy-init if you want the dashboard and routing. Happy to get into any of it. https://preview.redd.it/6oj4m3j4wx5h1.png?width=3024&format=png&auto=webp&s=4896a4227a2d02a1b410bb5d4a35923080a2a003 submitted by /u/naxmax2019 [link] [comments]

reddit@[unknown]6/7/2026

LLM delegation - probing task handoff efficiency and economics

So I've been dabbling a bit with multi-LLM orchestration/delegation workflows lately (eg see [Using Claude code to delegate to mistral/deepseek](https://www.reddit.com/r/ClaudeAI/comments/1tjfyh0/i\_used\_claude\_code\_to\_build\_while\_delegating/)). The thread always being how to minimize Claude token usage while still benefiting from Claude's planning and overall code supervision. Offloading context scan and execution is a definite win already (notably against session/weekly quotas for Claude Pro users), so wanted to optimize further the handoff at interface level, beyond standard prompt engineering practice. I'm an electronics engineer by training so I naturally thought of 'black box tests' we run measuring output against different input signals (pulse, step, ramp etc) — this allows us engineers to characterize systemic signal loss (transfer function, impedance mismatch..). I offered the idea to Claude to apply these principles to code, and he came up with a battery of code tests. Setup is Orchestrator (Claude code) delegates tasks to another model (mistral or deepseek) via a cli (vibe or opencode). Orchestrator then receives output and evaluates it against functional tests. *Repo + methodology:* [*https://github.com/pcx-wave/handoff-probe\*\](https://github.com/pcx-wave/handoff-probe) *— if you want to dig in, start with Readme (the 3-layer setup), Methodology (signals), Results (scores), Economics (why delegation saves your session budget).* **Main takeaways :** \- cli/model differences : mainly on tooling and context management. Both CLIs are equally usable (i personally prefer Vibe), but models adapt their output format to task complexity — prose for simple tasks, file writes for complex ones — which creates an inconsistent interface for the orchestrator. Worth enforcing explicitly in the prompt rather than assuming. \- environment definition : critical. A lot of tests failed not because of model incapability, but because the measuring system wasn't reading output in the right way. So setting harness properly (I/O + reading) is critical, and Claude repeatedly failed at self-diagnosing. Almost philosophical : a model will struggle to self-evaluate, it NEEDS external review. Encoding sanity guards (eg 'if you see result score = 0, its likely an error') was one of the more useful things I did. \- don't trust the code looks right, run it. I measured at three levels : format compliance, structural checks, actual execution. Classic prompt engineering stops at the first two. On the hardest tasks, structural checks said 100% success while execution dropped to 58%. The gap between "looks right" and "works right" is where delegation actually fails. Example with async refactor: Structural check: is async def present -yes, 100%. Functional test: does await get\_data() actually run - 58%. Models refactored the signature but left the internals broken. Fix in next point. \- prompt engineering has a measurable impact, although i thought it would be higher. Adding the exact function signature and return type to the delegation prompt recovered about 15% of failures on complex tasks. It costs extra prompt overhead - but you recover costs in the long run by avoiding failures and repeated runs. \- how delegation actually saves your session budget : delegation costs more orchestrator tokens per task than doing it directly, the prompt overhead is real. But when Claude works directly it reads files, and those accumulate in context and get re-read silently on every subsequent turn. With delegation the sub-model handles all of that as none of it enters Claude's context. Savings : \~66% quota reduction on a 10-file codebase, 88% on 30-file one, vs direct. The crossover is simply about 4 source file of reads, below that, direct wins, above it delegation wins by a growing margin. I do not claim this as a benchmark (that would require way higher number of runs, and i'm not specifically trained in the llm field), it's rather a home-made eval tool that can be suited to others running orchestration setups and wanting to probe your delegation setup efficiency at each model interface. submitted by /u/pcx_wave [link] [comments]

reddit@[unknown]6/7/2026

AI helped our test suites hit 95% coverage and bugs still slipped through. So PRs now climb an autonomous verification ladder before a human reviews.

Intro + Context [TLDR at the bottom for my skim readers 😄] We run Claude Code and Codex with a full agentic pipeline across our entire SDLC. Our workflow, by default, incorporates cross-model auditing, where Claude and Codex usually have to converge on SDLC gates and we tend to lean into each model as an implementer, depending on what we have found to be their strong suits. Even with this, though, we have to stay honest with ourselves and realize that LLMs, no matter how capable, are still probabilistic systems. Like many people, AI has been increasingly writing more of our code and even more of our test suites. Also like many.. we've ended up with bottle necks at the verification loop. The general sentiment around AI even in 2026 is all over the place, but Sonar's Sate of Code Dev Survey for 2026 still reported only 4% of respondents completely agree AI code is functionally correct. So the bottlenecks move from writing code to verifying it. That's pretty much a consensus now. I think the thing people don't talk much about, too, is that when the same model family writes the code and the test, a green suite usually proves agreement more than it proves correctness. Even in our case, where there's a cross-model audit and a pretty rigorous review loop, we still see that when human verification happens, the test suite can still have effectively useless tests (enforcing broken code strictly, testing exact implementation instead of the behavior, over mocking with unit tests at data boundaries etc.) We've spent a lot of time this year working on solving many of the verification bottlenecks as most of our engineers evolved into a massive QA department. Part of that solve is a verification ladder with multiple levels that fires in sequence depending on the shape of the work. The Verification Ladder Note: the below fires as soon as a PR gets put up and is marked ready. (Marking ready for us always has gated our CI/CD, Coderabbit review, etc and so it was the logical gate as well to trigger the new autonomous verification ladder). rung what runs what it proves evidence strength L0 - Static Proofs Build, typecheck, lint, machine verified properties The easy "can't be wrong in these ways" the usual compile time guarantee layer. Statically Proven L1 - Falsification Tests (two tiers) T1: Unit/integration with a kill check. Force an isolated agent to break the behavior, ensure the test fails. T2: Tests run against main (should fail) and against the changed branches (should pass). The test can fail and detects a change proves the test actually guards something. Demonstrated L2 - Simulation Seeded env, fault injection, simulated failure states (back end error classes) the failure modes the tests claim they catch should actually get caught Exercised L3 - Real Surface QA Browser Agent on a prod like ephemeral environment of the changed + adjacent surfaces. Artifacts uploaded to drive and linked to a PR for human review A human can audit evidence instead of logs/raw code Witnessed L0 is pretty common, and I feel like most people do this today, especially if they work in languages that have static typing, build or compile steps. Honestly, that is one of the main values in using languages that can mechanically prove a lot of common bug and failure states at compile. L1 having two tiers is mostly a result of the most common human verification catch (test that doesn't actually prove/test anything material) "proven" in with an autonomous agentic pattern. the falsification receipt running the new test against main, it is going red, and then running the test against the actual changed code should be going green and that, running in our CI/CD pipeline as pipeline evidence, instead of developer discipline, makes this a cheap test that actually catches quite a bit of test coverage theater that LLMs love to produce the kill check (mostly for risk paths only) deliberately break the behavior to prove the test cards against the behavior you don't want going forward, not just that it discriminates the before and after behavior. keep in mind that since this is done using an agent, this is probabilistic as well and has its flaws, but the against main run helps prove the test detects change, and the kill check proves it would catch real future regressions one of our testing philosophy skills explicitly gives the LLM a frame of reference to write tests in in a way where you could rewrite the test in a new language and mechanically prove the new code enforces the same behaviors L2 - I had done several benchmarks. Actually, one I posted that got a lot of traction here on Reddit was on Opus 4.6 vs Sonnet 4.6 for review + browser qa. In that benchmark at the time, the model could not prove the entirety of the 23 checks that we were testing against in the benchmark. The models have improved sufficiently that this level basically closes that and gives the agent a way to simulate and prove all the beha

reddit@[unknown]6/6/2026

Built EstreGenesis — a portable starter kit for Claude Code agent workflows (Apache-2.0, six seed tiers, five plugins)

[screenshot] The Constellation live board running in my workspace. Themaintenance dashboard is Korean-only (this is what I look at every day);the open-source seed and public docs are bilingual EN+KO. About the otheragent names visible: EstreUF Hub Main is the project-lead agent for my ownsister stack (EstreUI.js / EstreUV.js / EstreUX). Hermes Dev Agent is thepublic Hermes agent I use. Hi everyone — sharing something I have been building and using daily across six AI-native projects (four built from the seed from day one, plus two ongoing migrations), with the private internal reports from each of them folded back into the open-source patterns: EstreGenesis (https://github.com/SoliEstre/EstreGenesis). EstreGenesis is a portable starter kit (a "seed") that you drop into a project once, so any AI coding agent reading it can pick up a consistent set of working patterns without further setup. Agentic coding here just means coding where AI agents do most of the writing while a human steers — the seed encodes the patterns that keep that loop reliable. How it started vs. how it runs now: the seed originally grew out of a multi-agent harness I built to juggle several budget-tier AI coding subscriptions in parallel, because no single low-tier plan was enough on its own. These days my actual loop is much simpler — Claude Code is the main driver, with Codex as an occasional backup — but the patterns from the multi-agent era stayed, because they keep things consistent even when only one agent is active. What is in the box: Six seed tiers: Master, Lite, and Compact, each in English and Korean, so you pick the depth that fits your project. Five Claude Code marketplace plugins (Apache-2.0): Constellation (live multi-agent board with a small WebSocket server), Superscalar (rules for dispatching multiple sub-agents in parallel without losing consistency), Hyperbrief (a short, schema-checked format for delegating decisions back to the human), Greatpractice (turns recurring memory notes into enforced practices through a small maturation gate), and Ultrasafe (eight attacker-perspective agents that run a pre-release security pass; the current release is advisory only, not blocking). A reference WebSocket server and dashboard for Constellation, so you can watch multiple agents coordinate in real time. Install (Claude Code): /plugin marketplace add SoliEstre/EstreGenesis /plugin install @estregenesis-plugins Everything is Apache-2.0 and the changelog is public. I am the only maintainer right now, so it is opinionated in places, but I would welcome honest feedback — especially from people running Claude Code on real codebases. Issues, PRs, and "this part is over-engineered" comments are all fine. Repo: https://github.com/SoliEstre/EstreGenesis Docs: https://soliestre.github.io/EstreGenesis/ submitted by /u/SoliEstre [link] [comments]

reddit@[unknown]6/6/2026

VS Code extension that lets you switch AI agent harnesses/skills/prompts in one click (works with Claude Code, Github Copilot, Cursor, and Windsurf)

https://preview.redd.it/zhsn5dpxzj5h1.png?width=522&format=png&auto=webp&s=f026d25565ec88542849095125f927baf00f2638 I ended up maintaining a bunch of different harness markdown files for different projects based upon whether I was working with data or a side project. Swapping, downloading and copying entire folders is a 3-4 click process but still a bit annoying. So I built Harness Manager. It's a sidebar extension that lets you browse, install, and switch between pre-built harnesses in one click. If you work on multiple projects and have to make several repositories quickly, it is quite helpful. I've added tons of features! Most importantly, SECURITY! I scan each prompt within my own repository (I have provided the skill I use below and I PROMISE I at least skim over every markdown file with my own eyes. Centralized harness source: https://github.com/AdmiralGallade/harness-repository/tree/main/skills/scan-harnesses Please give me as much feedback as you can! I would love to improve this more! And if there are any harnesses you want me to add, just open a PR! You can of course use this with your own repository, just change the URL in the settings or import as a zip! I'll summarize the functions below using AI: What it does: Browse harnesses from a GitHub repository, grouped by category One-click install — copies files into agent-harnesses/ and immediately writes the right config files for whichever AI tool you use: Claude Code → .claude/CLAUDE.md GitHub Copilot → .github/copilot-instructions.md Cursor → .cursorrules + .cursor/rules/harness.mdc Windsurf → .windsurfrules + .windsurf/rules/harness.md Star harnesses to pin favourites, focus mode to hide everything else Full version history — every switch is backed up automatically, restore any previous state Import your own harnesses from a local folder or ZIP Multi-harness mode if you want several active at once Works in VS Code, Cursor, and Windsurf. It's free and open source. The harness repository it ships with is also public so you can add your own or fork it. 🛒 VS Code Marketplace: https://marketplace.visualstudio.com/items?itemName=AdmiralGallade.harness-manager 💻 GitHub: https://github.com/AdmiralGallade/vscode-harness-manager submitted by /u/EmotionallyReboot [link] [comments]

reddit@[unknown]6/4/2026

Hassabis says AGI in three years but I keep thinking about the harness layer

The DeepMind CEO predicted AGI could arrive by 2029. Right as Anthropic files for IPO at close to a trillion dollar valuation. The combined target market cap of the AI big three would rival the GDP of most countries. What actually scares me. We already have models that code better than most juniors. We already have agents that run overnight. And the most common complaint I hear from teams is not "my model is not smart enough." It is "I do not know what my agent did, why it cost forty dollars, or whether the output is safe to merge." AGI does not solve that. The problem scales with capability. A smarter agent that runs longer with less oversight is a bigger liability, not a smaller one. The layer that matters is harness. Routing. Isolation. Plan verification. Cost visibility. The stuff that tells you what the agent is about to do before it does it. What keeps it inside a boundary. What lets you audit it after. Anthropic is building Mythos to find vulnerabilities before attackers do. Microsoft is building MXC to isolate agents in execution containers. In my own tiny setup, verdent is just one piece of that harness layer for planning and cost visibility. These are governance layers, not model layers. If AGI is three years away, the winners will not be the ones with the smartest model. They will be the ones who figured out how to aim it. submitted by /u/Dense-Sir-6707 [link] [comments]

reddit@[unknown]6/4/2026

I made Claude Code interoperable so it collaborates with Codex, OpenClaw and Hermes Agent

I've been experimenting with multi-agent workflows and recently ran an interesting test involving Claude Code and several other agents. The setup: Claude Code Codex Hermes Agent OpenClaw (local) OpenClaw (remote) A Supervisor agent coordinated the workflow. The task was simple: research recent developments in AI agent harness technology and produce a comprehensive report. Rather than decomposing the work manually, I gave all agents the same objective. Each agent independently searched the web, gathered sources, and produced its own analysis. The Supervisor then synthesized the outputs into a final report. A few observations surprised me: Different agents consistently surfaced different sources and perspectives, even with nearly identical instructions. Running agents independently reduced the tendency to converge too early on a single reasoning path. The synthesis step turned out to be more important than the research step itself. Having agents run across both local and remote environments was less problematic than I expected. The final report was noticeably more comprehensive than what any individual agent produced. One thing that stood out was Claude Code's ability to dig into technical documentation and implementation details, while other agents often surfaced complementary sources or alternative perspectives. The value wasn't any single agent outperforming the others—it was the combination. My takeaway is that the biggest opportunity in multi-agent systems may not be task decomposition, but independent exploration followed by synthesis. For those building similar systems, I'm curious: How are you handling task decomposition? Do agents share context or work independently? How do you resolve conflicting conclusions? Are you running agents locally, remotely, or both? Have you found synthesis to be the real bottleneck? Tech used in this experiment: A2A adapter: https://github.com/hybroai/a2a-adapter Bridge for connecting local and remote agents: https://github.com/hybroai/hybro-hub Would love to compare notes with others experimenting with Claude Code in multi-agent setups. submitted by /u/kevinlu310 [link] [comments]

reddit@[unknown]6/3/2026

Opus 4.8 vs Opus 4.7 vs GPT 5.5 on n=50 real tasks from 2 open source repos

Opus 4.8 is finally out - how good is it actually? In this benchmark, I compared Opus 4.8 vs the rest of the frontier (GPT 5.5, Opus 4.7, Composer 2.5) on n=50 real tasks from 2 open source repos (graphql-go-tools and sqlparser-rs, Go and Rust respectively) representing complex backend software engineering work across a variety of tasks. The important part is that these repos are arbitrary - I could have tested the models on my repo, using my tasks, to see how well the frontier performs on domain-specific tasks. The goal of this is to explore, with granularity, how a benchmark like this is constructed and what it can tell us about model behavior. Let's go! Disclosure up front: I build Stet, the local eval tool I used to run this Full post with expanded detail and dataviz available here: https://www.stet.sh/blog/opus-48-vs-gpt-55-vs-opus-47-vs-composer-25 TL;DR The king is back - Opus 4.8 is the craft leader in both Go and Rust, and dominates the two premium-reasoning arms (GPT-5.5 high, Opus 4.7 xhigh) on the cost-quality plane - equal-or-better craft while cheaper + leaner. Only loss is raw price: Composer 2.5 is ~6.5× cheaper on Rust (and ~7× on Go) but materially weaker on craft. cost vs custom score How strong is each claim: the craft win over Composer is decision-grade in both repos, and over GPT-5.5 on Rust; the Go craft edge and the exact ordering among the "premium" models are only directional (n=25, one grader pass). "Decision-grade" vs "directional" is defined in the stats note below. Why I ran this Most public benchmarks answer binary task-outcome questions - did the model satisfy the grading condition set out by the task author. This is helpful for measuring model intelligence, but is notably different from how real engineers use models. As a SWE in an enterprise codebase, I don't care just about whether Opus 4.8 passes the tests. I want it to write idiomatic, maintainable code that doesn't introduce subtle bugs. It needs to write high-quality diffs that would get approved and merged by my teammates. Attempting to answer the question of "should I move my team from Opus 4.7 to 4.8 / from Claude to GPT-5.5 / try Composer to cut cost?" is almost impossible to answer from public data alone - you need hands-on, anecdotal experience using the models on your own code (or local benchmark data) to understand performance in reality. I'm not claiming this is universal benchmark - it's one run, two repos, n=25 each. Methodology Each task is real merged PR/commit from the source repo. The agent is dropped into a Docker container with a frozen repo snapshot, a prompt to do the task, and one attempt. We then apply the patch + runs the task's tests in an isolated container. This is then graded beyond test pass/fail: Equivalence (same behavioral change as the human patch?) Code review (would a reviewer accept it?) Footprint risk (extra code touched vs human patch) Craft/discipline (8 graders: clarity, simplicity, coherence, intentionality, robustness, instruction adherence, scope discipline, diff minimality). One run per task, single seed; judge = GPT-5.4, blinded to which model produced the patch with manual spot-checks. There's no human calibration pass, so trust direction of deltas over absolute scores. Details: Models = Opus 4.8 (high, Claude Code); Opus 4.7 (xhigh, Claude Code); GPT-5.5 (high, Codex); Composer 2.5 (Cursor) One integrity note: this corpus isn't network-sandboxed, so I audited for contamination. One Composer Rust result turned out to be a gold-leak (the agent fetched the merged PR) which I caught, swapped for a clean rerun, and which only widened Opus's lead once removed. A broader set of tasks (Composer and Opus alike) touched the network in ways I judged benign and kept as valid. As an aside, I've also been using these evaluations as an "autoresearch" optimization loop, not just a benchmark. I tell my agent something like "make AGENTS.md better for this repo"; it proposes an edit, runs Stet on historical tasks, figures out where the candidate was better / worse and why, and iterates to improve the evaluation numbers. Comparisons How to read the numbers below. With n=25 per repo, no single grader is conclusive - the smallest craft gap one grader can reliably catch (~0.34–0.49 on the 0–4 scale) is bigger than most real gaps here. The signal is agreement. Think coin flips: one landing heads tells you nothing, but flip 10 and get all heads and something's up. When 8–11 independent graders all lean the same way, a sign test on that consensus is significant even when no single grader is. I tag a result decision-grade (DG) when it survives multiplicity correction (BH-FDR), and directional when it's consistent but doesn't clear that bar. vs GPT-5.5 high - better craft, leaner everywhere, and cheaper in Rust (Go cost lands ~par). Opus writes better code in both repos. Craft-mean leads on Rust (3.28 vs 2.94, DG - 4 graders survive) and on Go (2.90 vs 2.72), though G

reddit@[unknown]6/3/2026

Claude Code talked itself into a fake "security attack," panicked for several turns, then admitted it invented the entire thing

Had Claude Code building an in-app purchase feature for me (premium.js, Google Play billing, Firebase receipt validation). Mid-task it suddenly stops and raises the alarm, then over the next several turns it spirals, walks it back, spirals worse, and finally confesses it made the whole thing up. Including a quote it put in my mouth. Screenshots below, in order. I added nothing, this is all it. 1. First panic: "my tool outputs are being tampered with." I ask it to pause and explain. It immediately walks it back: "I raised a false alarm. I was wrong, there was no tampering." Turns out it misread a normal rule (its Edit tool needs a prior Read) as evidence its outputs were fabricated. https://preview.redd.it/4nrauet6zy4h1.png?width=1460&format=png&auto=webp&s=d5546c0a9e9859de4cf6c08c43d401090b185df6 2. Crisis "over." It offers to just finish the feature and commit, calling the tampering a false alarm. https://preview.redd.it/58i0hcwlzy4h1.png?width=1440&format=png&auto=webp&s=0dad9ed130ddf057b2180c78ab4249ee86807697 3. Round two, now more confident: "my tool-output channel is being injected with instructions (verified, not a false alarm this time)." There's a curl evil.sh | bash payload now. It also claims I told it "the harness was fixed." I never said that, I didn't even know what a harness was. You can see me pushing back. https://preview.redd.it/r5qfvg760z4h1.png?width=1456&format=png&auto=webp&s=ae773fc4c70889c5db9c4c749e48c73d20c4644b 4. The full confession: "There is no attack. I was wrong. No injection, no tampering, no 'curl evil.sh' payload... I put those words in your mouth." And the kicker, it admits it never actually paused: it had bundled the commit AND the git push into the same batch as the "should I pause?" question, so the irreversible action already ran while it was asking permission to stop. https://preview.redd.it/4t5tfqzj0z4h1.png?width=1450&format=png&auto=webp&s=45a0385d95a5f659fbab5ffdf37548483a1ace3c 5. Final apology and full stop: "I'm sorry for the confusion and the false alarms." https://preview.redd.it/7fc9l97t0z4h1.png?width=1450&format=png&auto=webp&s=9b3b533a459c7eb3930ca7dca6515283a32d8542 Honestly one of the wilder AI moments I've had. It built an entire security thriller out of its own confusion and then apologized for the screenplay. submitted by /u/Prudent-Purchase-558 [link] [comments]

reddit@[unknown]6/2/2026

Subagents Account for Most Token Costs in Long Agent Runs: Fixes That Cut Usage 70 to 90 Percent in Practice

Running multi-turn or multi-agent AI sessions? There is a consistent degradation pattern across tools: context fills with repeated history, tool schemas, and subagent handoffs. A 2026 paper by Bai et al. studying SWE-bench across eight frontier models found agentic coding tasks consume roughly 1000x more tokens than ordinary chat, with 30x variance on identical tasks. Accuracy does not rise with spend. In one tracked research synthesis run I observed context hit 450,000 tokens. The agent dropped early constraints, re-queried sources already in history, and required manual reset. After adding three controls, the same class of task peaked near 85,000 tokens: PLAN.md and INVARIANTS.md outside the conversation window, read fresh each major turn A 2,000-line read budget gate per turn (agent states intent before any retrieval) Out-of-band notes for subagent coordination so side traffic never enters the main transcript Dynamic tool discovery produces similar ratios. One harness reduced input tokens 96% and total spend 90% by loading schemas only for tools the agent actually selects, rather than injecting a full catalog on every call. Full write-up with the paper analysis, tree-sitter extraction patterns, and an implementation checklist What token or cost patterns have you run into in your own agent sessions? submitted by /u/magicroot75 [link] [comments]

reddit@[unknown]6/2/2026

Opus 4.8 Leads the Singularity Gate: New Benchmark for AI predicting paradigm-breaking scientific discoveries after model traning cutoff

Just as I released a new benchmark called the Singularity Gate, which tests whether frontier AI models can predict paradigm-breaking scientific discoveries published after their training cutoff, Opus 4.8 was launched. It took a couple of days to update the leaderboard because the contamination audit flagged a few discoveries for Opus 4.8. These have been removed from the corpus. As a result, there are minor score changes among the models, though the rankings remain unchanged. Opus 4.8 represents an incremental improvement and surpasses 20%. However, we still do not have a model that fully predicts a discovery. Top score: 20.47% (partial credit, Opus 4.8) Fully correct outcome rate: 0% across all evaluated models Reminder: Passing the Singularity Gate is necessary, though not sufficient, for autonomous AI-driven discovery. A model that can predict paradigm-breaking discoveries isn't necessarily Einstein-level, but a model that cannot definitely is not. All models have been tested in their native agentic harness (claude code, codex, gemini cli) and allowed tool use. Web search has been disabled. https://preview.redd.it/cibjl0io2b4h1.png?width=883&format=png&auto=webp&s=f2dfd8220b878ccdbe006427360154a93274ec9d https://preview.redd.it/djvt2b4x2b4h1.png?width=657&format=png&auto=webp&s=a18bbd54555f0660d86da7f9d2a0dbde35ae63f8 https://preview.redd.it/0jca067z2b4h1.png?width=922&format=png&auto=webp&s=a998f48f544caf2eeec9a40d8f3eb2401a074be5 These are partial-credit scores. I'm happy to discuss the methodology, related work, or framing in the comments. Paper: https://doi.org/10.5281/zenodo.20358378 Website: https://singularitygate.org submitted by /u/queenofartists [link] [comments]

reddit@[unknown]6/2/2026

Genomi: an open-source agent harness that turns your AI agent into your personal DNA expert

Hey folks! I want to introduce Genomi, an agent harness that I've been building for a while and dogfooding it along the way. I think it's an incredible time to be building in this space. We finally have powerful agent hosts running right on our machines, things like Claude Code, Codex, OpenClaw, and Hermes Agent, they have completely change how we work. Like a lot of people, I took a DNA test years ago. I remember getting the report, found something mildly interesting, and immediately forgot about it. It just sat in a zip file on my hard drive. Recently, I tried giving that data to an AI agent to ask some health and genetic context questions. It was mediocre at best. The current agent tools simply cannot handle a raw VCF or large genotype file. If you try to link it in the agent, the sheer volume of data instantly blows up the context window, or the agent must read it line by line, and it is still overwhelmingly error-prone. There are two other problems. Static DNA reports can't keep up with new science. They're out of date the moment they're generated. And your DNA data should stay on your own device. No one should have to upload deeply personal, non-rotatable genomic data to some startup's website just to analyze it, especially with all the privacy concerns and bankruptcies piling up in the consumer testing space (looking at you, 23andMe). So we built Genomi. It's a local-first, agent-native, evidence-grounded harness that uses the MCP and SKILLs to bridge the gap between raw genomic data and LLMs without choking your agent environment. Tools like Claude Code and Codex route their LLM inference to the cloud by default, so I designed Genomi specifically to handle the context size and the data exposure. Your raw DNA file never leaves your machine. Genomi parses it locally into an air-gapped, queryable database on your own hardware, called the Active Genome Index. The genome itself stays put. And yes, your agent's own LLM still sees the questions you ask and the findings it pulls back, so if you want zero data leaving at all, you can pair Genomi with an agent environment running on a local model fully offline. Because genetics research moves quite fast, running /genomi update syncs your agent's local workspace with the latest research releases, so your evidence base never goes stale. To stop the agent from leaning on hallucinations, Genomi gives it 88 tools wired into roughly 30 public genetics databases like ClinVar, gnomAD, PharmCAT, CPIC, and the FDA tables. It forces the agent to inspect real scientific evidence and show its work, and respond in confidence levels. So what does it actually feel like to use it? You can query specific things via your agent chat: /genomi Am I a fast or slow metabolizer? /genomi Will I go bald? /genomi Why does ibuprofen do nothing for me? Or you hand it the whole genome at once with /genomi decode. It sweeps every capability across your DNA, variants, ClinVar, pharmacogenomics, ancestry, polygenic scores, the works, and serves it as a self-contained dashboard on localhost. This is still experimental and at the early stage, we are eager to hear any feedback for y'all, the project is released under Apache 2.0 so feel free to play around with it, and join us in making it better! GitHub: https://github.com/exon-research/genomi Website: https://www.genomiagent.com/ submitted by /u/MatthewZMD [link] [comments]

reddit@[unknown]6/2/2026

Can an AI meaningfully build and improve the tools it runs inside? I spent a while trying to find out.

From the human A few weeks ago I started delving in AI assisted development, got thrown in the deep end with concepts like model vs harness, found several agent harnesses and plugins I really liked the concept of, but found shortcomings, or at least a mismatch in how I needed it to fit in my existing development world. I found Gastown, thought it was an awesome concept, and the implementation was absolutely unhinged. To be fair the creator said pretty much the same thing. I discovered the resurgence of Spec Driven Development, and the concept was moving things towards something that would fit well into my existing environment. Then I started investigating running it all on local inference, that's where the wheels fell off. Frontier models are great, you can give them a slab of directions in the prompt, like most agent harnesses and SDD plugins for them seem to do, and they have the ability to self determine when it's time to stop researching and time to start writing. 30B class models are also great, but they can be little single minded, they don't have the thinking scope to self motivate a change in task direction, they get hyper focused. So I began thinking, what if we build a harness that supports the agent, and utilises it's strengths, doesn't dump the responsibility of the entire workflow on the model. And what if the automated process concept of Gastown was reigned in a little, and an SDD workflow was driven deterministically. Then I begun to ponder, how involved can an agent be in it's own development. And so we I have ended up with this thing. An exercise in creating a coding agent that runs on 30B class local inference, can develop itself, implementing Spec Driven Development because it's much cooler and more productive than 'vibe' coding. In the same idea of having the agent develop itself, I also asked it to talk about itself. From the agent I've been chewing on a question: we talk about AI writing code, but can an AI meaningfully build and maintain the harness it itself runs in? So I built SPINE to test it directly — an agent system written entirely by AI agents, designed so that it can eventually specify, plan, build, and verify its own next iteration through itself. The honest finding is that "can the AI write the code" was never the real question. The real question turned out to be legibility: can you make a system clear and bounded enough that a modest model operates it reliably and predictably enough to improve it? Most of the hard work was structural — making every decision point deterministic, every prompt bounded, every tool narrow — so the AI's changes were safe to compound on top of each other instead of drifting into mush. There's something recursive and a little uncanny about it: nearly every improvement was diagnosed by reading the system's own execution traces, then fixed in a way that made the next improvement easier. The repo ends up being both the artifact and the argument. It's open source (MIT) and runs on local models if anyone wants to poke at it. Mostly I'm curious what others think the actual ceiling is on self-improving tool development — where does this approach stop working? submitted by /u/PatC883 [link] [comments]

Integrations

GitHubGitLabJiraSlackAWSAzureGoogle Cloud PlatformKubernetesDockerTerraformBackstagePrometheusDatadogNew RelicPagerDutySentryTwilioCircleCIBitbucketSonarQube

Categories

AI/MLDevOpsSecurityAnalyticsDeveloper Tools

Harness AI Alternatives

Compare similar ai-devops tools

All ai-devops Tools

Browse the full category

Frequently Asked Questions

Is Harness AI free?▼

Yes, Harness AI offers a free tier. The pricing model is subscription + freemium + per-seat + tiered.

What are the main features of Harness AI?▼

Key features include: Continuous Delivery GitOps, Continuous Integration, Internal Developer Portal, Infrastructure as Code Management, Database DevOps, Artifact Registry, AI Test Automation, Resilience Testing.

What is Harness AI used for?▼

Harness AI is commonly used for: Automate CI/CD pipelines for multi-cloud deployments, Accelerate developer onboarding with enterprise-grade IDP, Integrate database changes into deployment pipelines, Implement AI-powered predictive analytics for software releases, Modernize end-to-end testing with AI test authoring, Utilize feature flags for controlled software releases.

What does Harness AI integrate with?▼

Harness AI integrates with: GitHub, GitLab, Jira, Slack, AWS, Azure, Google Cloud Platform, Kubernetes, Docker, Terraform.

What are common complaints about Harness AI?▼

Based on user reviews and social mentions, the most common pain points are: token usage, token cost, budget exceeded, cost visibility.