Harness is a unified, end-to-end AI software delivery platform to manage the SDLC using purpose-built AI agents.
"Harness AI" appears to be well-regarded as a tool tailored for enhancing AI coding sessions, particularly with its focus on optimizing processes and managing autonomous AI agents effectively. Its main strengths include facilitating automation and providing a reliable framework for running and managing AI models, which resonates well with programming enthusiasts. Complaints seem to revolve around issues with certain AI tools not integrating seamlessly or creating unexpected results, causing occasional disruptions in expected workflows. While explicit pricing sentiments are not clearly discussed, the overall reputation seems positive, with a general appreciation for its open-source capabilities and innovation in handling AI tasks.
Mentions (30d)
41
8 this week
Reviews
0
Platforms
2
Sentiment
19%
17 positive
"Harness AI" appears to be well-regarded as a tool tailored for enhancing AI coding sessions, particularly with its focus on optimizing processes and managing autonomous AI agents effectively. Its main strengths include facilitating automation and providing a reliable framework for running and managing AI models, which resonates well with programming enthusiasts. Complaints seem to revolve around issues with certain AI tools not integrating seamlessly or creating unexpected results, causing occasional disruptions in expected workflows. While explicit pricing sentiments are not clearly discussed, the overall reputation seems positive, with a general appreciation for its open-source capabilities and innovation in handling AI tasks.
Features
Use Cases
Industry
information technology & services
Employees
1,700
Funding Stage
Series E
Total Funding
$802.1M
Reviving PapersWithCode (by Hugging Face) [P]
Hi, Niels here from the open-source team at Hugging Face. Like many others, I was a huge fan of paperswithcode. Sadly, that website is no longer maintained after its acquisition by Meta. Hence, I've been working on reviving it. I obviously use AI agents to parse papers at scale and automatically generate leaderboards (for now I'm the one verifying results). So far, I've only parsed high-impact papers for which I know they're SOTA, like Qwen 3.5 and 3.6, RF-DETR for object detection, DINOv3, SOTA embedding models from the MTEB leaderboard, the Open ASR Leaderboard for automatic speech recognition models, etc. For now, it includes the following: trending papers by default based on Github star velocity categorization by domain, e.g., OCR methods, which PwC used to have, e.g., RLVR eval results for high-impact papers, see e.g., Qwen 3.5 at the bottom leaderboards for each domain, e.g., MMTEB or COCO val 2017 support for citation counts (you can also see the most cited papers by domain!) automated linked Github, project page URLs, and artifacts (+ multiple repos are supported on a paper page) support for external papers beyond Arxiv, see e.g., DeepSeek v4 Harness reports for coding agent benchmarks, e.g., Terminal Bench "Sign in with HF" and Storage Buckets are used to store humbnails, paper PDFs, and overall data backups. I'm curious about your feedback + feature requests! Try it at paperswithcode.co https://preview.redd.it/whwji560fw1h1.png?width=3452&format=png&auto=webp&s=55bb7a30c1be58d140f7efcb07a31c6dac5693c7 See e.g. the SOTA leaderboard for Terminal Bench 2.0: https://preview.redd.it/98w9pi89fw1h1.png?width=3456&format=png&auto=webp&s=408fb64b0ba85ba24f55daa81d547d7c68e73951 A paper page looks like this: https://paperswithcode.co/paper/2602.15763 https://preview.redd.it/fiizit6dfw1h1.png?width=3450&format=png&auto=webp&s=9ea05a77ca5583a2fb395dccc95ba52c433362c5 submitted by /u/NielsRogge [link] [comments]
View originalHow do you "level up" your claude to harness creation?
Hey guys, I'm an avid user of claude code for personal projects, both in the planning and execution of small personal projects as a life-long hobbiest programmer, it's great at filling in my technical gaps. Recently, I realized there's a lot of potential within my professional career (automation/process engineering) to help with design->execution, and put claude through the test and was really surprised by its ability to perform my job. I made a cool workflow demo and pitched it to my boss who I got on board. Now I'm looking to bring this as a full project, but I'm really floundering on how you ship a true AI harness here - I know I'll need obelisk to capture my job elements, I know I'll want to create validation tools, and I'm assuming I'll want separate agents for all of these, but I'm really struggling to understand how people "package" these and have them live outside of a claude github repo like I've done for all of my personal stuff. I'm likely not the programmer here, but I need to know enough to drum up a project. Are there any actual tutorials on a full agenic pipeline here? I've watched lots of videos talking about the subject but none that really touch on what the heck it is you're truly putting together here. submitted by /u/akerson [link] [comments]
View originalI replicated Anthropic's Generator-Evaluator harness to build a website through 12 adversarial AI iterations - here's the result and what I learned
Anthropic recently published their harness design for long-running apps — a multi-agent architecture inspired by GANs where a Generator builds code and an Evaluator critiques it in a loop. I built my own version using Kiro CLI and used it to generate a marketing website for my project Mnemo (persistent memory for AI coding agents). The architecture: Planner (runs once) → Generator ↔ Evaluator (12 iterations) Each agent is a separate CLI process with zero shared context. They communicate only through files (spec.md, eval-report.md). The Evaluator uses Playwright to actually browse the live site — not just read code. What made it work: Clean slate per invocation — each agent starts fresh, reads only its input files. Prevents context anxiety. Playwright MCP for testing — the evaluator navigates, clicks, resizes viewports. Catches visual bugs code review never would. Anthropic's frontend design skill — explicitly penalizes generic AI patterns (Inter font, purple gradients, card layouts). Forces creative risk-taking. Continuous iteration, not retry-on-failure— all 12 rounds run regardless. Each one improves. The progression was wild: Iteration 1: Exactly what you'd expect from AI — functional but forgettable Iteration 4: Generator pivoted to "Terminal Noir" — IBM Plex Mono, amber on black, grain textures, scanlines. This is the kind of creative leap that doesn't happen in single-shot generation. Iterations 5-12: Polish, accessibility, responsive fixes, reduced-motion support Stats: Total time: 3h 20min Iterations: 12 (generator + evaluator each) Manual code written: 0 lines (I fixed a few visual issues after) Tech: Next.js, Tailwind, Framer Motion, TypeScript Live result: https://mnemo-mcp.github.io/Mnemo/ Documentation : https://github.com/Mnemo-mcp/Harness Key takeaway: The model is the engine. The harness — the constraints, feedback loops, and adversarial structure around it — is what determines whether you get AI slop or something genuinely distinctive. submitted by /u/killerexelon [link] [comments]
View originalGPT 5.5 (Codex) leading the future prediction race
Researchers from the Max Planck Institute recently released FutureSim, an environment in which agents are replayed a temporal slice of the web and are tasked with predicting real-world future events. In their environment, GPT 5.5 leads at 25% acc, followed by Opus 4.6 at 20%. Open weight frontier models have a significant gap to catch up, with DeepSeek V4 pro at 13%, GLM 5.1 at 10%, and Qwen3.6 Plus at 5%. They say they evaluate with native harnesses (Codex, CC, etc). On some questions that have a parallel r/Polymarket market, GPT 5.5 in their simulation sometimes beats the crowd aggregate, like in the Super Bowl LX ($704M traded) market, which I think is pretty promising (and surprising). OpenAI really cooked with GPT 5.5 (and Codex) this time! Wonder how the trading market could evolve as models keep improving. submitted by /u/viciousA3gis [link] [comments]
View originalThe Frontier-Only Narrative Is a Financing Story, Not an Architecture Story
The frontier-only narrative is an artifact of how AI infrastructure is being financed, not how production systems are being built. The setup. Q1 2026 disclosed $112B in hyperscaler capex in a single quarter, $650–725B in 2026 guidance, and Alphabet's first 100-year bond by a tech company since Motorola 1997 (see a0109). The story that underwrites that paper is: every query needs a bigger model. The architecture says the opposite. Microsoft's Phi-4 (14B parameters) exceeds its teacher GPT-4o on graduate STEM and competition math. Phi-4-reasoning is competitive with DeepSeek-R1 at roughly one-forty-eighth the parameter count. Claude Haiku 4.5 is positioned by Anthropic and AWS for "economically viable agent experiences." None of this is a benchmark teaser — it is the production toolkit, available today. Routing is the missing component. RouteLLM (UC Berkeley, Anyscale) demonstrated over 2x cost reduction without sacrificing response quality. AWS Bedrock Intelligent Prompt Routing — generally available, official, supported — claims up to 30% cost reduction within a single model family without compromising accuracy. The Flagship Tax (see a0085) didn't just die; it left a vacancy at the architecture layer. The bookkeeping nobody wants to do. Operator audits suggest 40–60% of token budgets in production LLM applications are waste, dominated by default-to-frontier routing. Roughly 37% of enterprises with production AI workloads run five or more models in their stack. The rest are still defaulting to one. Why the story isn't being told. Hundred-year bonds don't pencil out on "use less compute per query." They pencil out on "every query needs a bigger model." The opacity in the harness (see a0107) is the symptom; the underwriting is the disease. What you do Monday morning. Treat model selection as a dependency-graph decision, not a vendor decision. Add a complexity classifier. Default to small. Cascade up when verification fails. Instrument model-mix as a first-class production metric. Bottom line. You are not behind because you have not bought the biggest model. You are behind because you have not built the router. submitted by /u/gastao_s_s [link] [comments]
View originalBuilt a structured workflow layer on top of Claude Code - looking for active contributors
I've been building claude-code-harness (github.com/anudeeps28/claude-code-harness) over the past few months - it's an open-source framework that brings structure and reliability to Claude Code workflows. What it includes: - 16 slash command skills - 14 sub-agents with deliberate model routing (right model for the right task) - Node.js hooks for lifecycle control - Tracker adapters for Azure DevOps and GitHub - Human gates at every critical phase - the core philosophy is that AI should amplify your judgment, not replace it I use this daily in my job as an AI Engineer, and it's become the backbone of how I build and ship AI systems. What I'm looking for: Contributors who care about this problem space - building AI systems that are structured, auditable, and human-in-the-loop. Not just people who want to merge PRs, but people who have opinions about how Claude Code workflows should work. If you've been using Claude Code heavily and have ideas, pain points, or want to contribute skills/subagents - I'd love to connect. Drop a comment or open an issue on the repo. Happy to answer questions about the architecture too. submitted by /u/lofty_smiles [link] [comments]
View originalAnthropic built the agentic features. Now they're billing them separately.
Starting June 15, Claude subscribers get a separate monthly credit for Agent SDK and claude -p usage: $200/mo for Max 20x, $100 for Max 5x, $20 for Pro. Once you burn through it, programmatic usage stops unless you've opted into extra usage billing at API rates. Your interactive Claude Code and chat usage stays on the subscription pool, untouched. I spent the last day digging into the community reaction across Reddit, GitHub, HN, and tech press. Tracked roughly 120 distinct opinions. Here's what I found. The sentiment split About 60% negative (credit is too small, feels like a value regression) About 25% pragmatic ("this was inevitable, the old model was broken") About 15% neutral to supportive ("interactive use is untouched, this is fair") Theo Browne (T3.gg) put it bluntly: anyone using T3 Code, Conductor, Zed, or claude -p in CI scripts had their effective usage cut by 25x. He said he now has to make the Claude Code experience on T3 Code "significantly worse." Ben Hylak (co-founder of Raindrop.ai) responded: "This is either really silly, or shows how bad of a spot Anthropic is in re: GPUs." Theo also said: "Framing this as a free credit instead of a regression for users is wild." That tracks with what I'm seeing across the threads. The telco parallel This follows the exact playbook telcos used with "unlimited" data plans. Sell unlimited. Watch users actually use it. Introduce a Fair Usage Policy that throttles heavy users. Continue marketing the plan as unlimited. Anthropic marketed Claude Code as an all-in-one agentic platform. They shipped Routines, /goal, /loop, scheduled tasks, and cloud sessions as headline features. Users adopted those patterns. Then the compute math didn't work out, and instead of solving the infrastructure problem, they drew a billing boundary inside their own product. Where the telco analogy breaks: Anthropic is capacity-constrained in ways telcos never were. They're spending aggressively on compute, and the resource contention isn't fabricated. But resource contention is an infrastructure problem, not a billing problem. And as we'll see, Anthropic did build the infrastructure to solve it. The question is why claude -p doesn't benefit from it. The contradiction that cuts deepest Here's what most people haven't articulated yet. Anthropic's product roadmap over the last 3 months has been aggressively agentic: Routines (cloud-hosted, schedule/webhook/GitHub triggers, no human in the loop) /goal (autonomous execution with minimal input) /loop (persistent in-session repetition) Scheduled tasks (desktop recurring prompts) Agent View (multi-session monitoring dashboard) Remote Control (manage sessions from phone) Every one of these features trains users to treat Claude Code as an always-on autonomous system. Anthropic productized exactly the usage pattern that the "you should use the API" crowd says doesn't belong on a subscription. But here's the catch. Routines draw from your regular subscription pool. claude -p doing the same work draws from the new capped credit. The billing line isn't "interactive vs agentic." It's "first-party agentic vs everything else." claude -p is the unix-philosophy composable interface for Claude Code. Penalizing users for calling the same primitive directly instead of wrapping it in Anthropic's GUI is anti-composability. If it were purely about cost management, Routines would also draw from the SDK credit. They don't. The distinction is about who controls the agent runtime. Then there's Managed Agents, Anthropic's API-side agent harness that entered public beta in April. Fully hosted runtime with cloud containers, built-in tools, and prompt caching baked in. API billing, pay-as-you-go. So now there are three tiers: Tier 1: Routines (subscription). Anthropic-hosted, flat-rate. They control the runtime, they optimize caching. Tier 2: Agent SDK / claude -p (credit). Your runtime, your code. Hard-capped. Caching APIs exist but you're on your own to implement them. Tier 3: Managed Agents (API). Anthropic-hosted again. Pay-as-you-go, but with full caching and compaction. Tiers 1 and 3, where Anthropic controls the runtime, get either flat-rate billing or optimized infrastructure. Tier 2, where you control the runtime, gets the worst deal. The strategy isn't "interactive vs programmatic." It's "managed vs unmanaged." The credit system is the squeeze play pushing you toward one of their managed options. Here's the nuance: prompt caching IS publicly available via the API. Agent SDK developers can use it. Cache reads cost 10% of base input token price. The optimization isn't gated behind Managed Agents. So why did third-party tools burn so many tokens? Many were unoptimized for Anthropic's caching compared to first-party tools. That resource contention was partly a third-party engineering gap. But that raises the obvious question: claude -p is Anthropic's own tool. They could bake caching into its runtime the same way they
View originalContinual Harness: Online Adaptation for Self-Improving Foundation Agents [R]
https://preview.redd.it/p9cd2zmfy01h1.png?width=2000&format=png&auto=webp&s=a8e99bac438c2505d97ed3716983aa731da855f8 Sharing a new paper from the GPP and PokeAgent teams. Gemini Plays Pokémon (GPP) was the first AI system to complete Pokémon Blue, Yellow Legacy on hard mode, and Crystal without losing a battle. How? Early signs of iterative harness development. In the Blue era a human watched the stream and edited the harness. By Yellow Legacy and Crystal, the model itself was performing most of the editing through general meta-tools (define_agent, run_code, notepad edits). Our new paper, Continual Harness: Online Adaptation for Self-Improving Foundation Agents, formalizes the loop and automates the refining role end to end. We then carry the same loop into training, enabling model-harness co-learning. The takeaways: 1. Iterative harness refinement closes most of the gap to a hand-engineered version. 2. Long-horizon agency requires self-refinement, and self-refinement requires a useful model. 3. The future of agents is model-harness co-learning. Paper (arXiv). https://arxiv.org/abs/2605.09998 Article (Substack). https://sethkarten.substack.com/p/gemini-plays-pokemon-discovered-something Project page (video demos). https://sethkarten.ai/continual-harness submitted by /u/PokeAgentChallenge [link] [comments]
View originalI'm cooked. Anthropic just split "--print" mode to $/mo credits
So, my entire project concept of an autonomous self-monitoring self-orchestrating Kanban production system for Claude Code to live within has now potentially been torpedoed. I built the entire system on the premise that [tickets + agents + hooks + executors -> "claude -p" -> hands-free always-on productivity]. Now that Anthropic has announced that "claude --print" will, from June 15 forward, be considered "programmatic" SDK usage, all jobs launched using "--print" will get billed on a separate monthly credit bucket, and not be covered by the Pro/Max CLI tokens. This means that the $100 monthly credit, which will dry up quick, is effectively a stop-gap measure against those of us who realized early that you didn't need to run Claude Code yourself, you just needed an AI harness that would run Claude Code for you. It was a workaround for API-like control without API billing. Yet another AI gravy-train ride has come to an end. Boooooo. Unless... share your workaround ideas below! submitted by /u/raedyohed [link] [comments]
View originalOpus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo
TL;DR I ran Opus 4.7 in Claude Code at all reasoning effort settings (low, medium, high, xhigh, and max) on the same 29 tasks from an open source repo (GraphQL-go-tools, in Go). On this slice, Opus 4.7 did not behave like a model where more reasoning effort had a linear correlation with more intelligence. In fact, the curve appears to peak at medium. If you think this is weird, I agree! This was the follow-up to a Zod run where Opus also looked non-monotonic. I reran the question on GraphQL-go-tools because I wanted a more discriminating repo slice and didn’t trust the fact that more reasoning != better outcomes. Running on the GraphQL repo helped clarified the result: Opus still did not show a simple higher-reasoning-is-better curve. The contrast is GPT-5.5 in Codex, which overall did show the intuitive curve: more reasoning bought more semantic/review quality. That post is here: https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve Medium has the best test pass rate, highest equivalence with the original human-authored changes, the best code-review pass rate, and the best aggregate craft/discipline rate. Low is cheaper and faster, but it drops too much correctness. High, xhigh, and max spend more time and money without beating medium on the metrics that matter. More reasoning effort doesn't only cost more - it changes the way Claude works, but without reliably improving judgment. Xhigh inflates the test/fixture surface most. Max is busier overall and has the largest implementation-line footprint. But even though both are supposedly thinking more, neither produces "better" patches than medium. One likely reason: Opus 4.7 uses adaptive thinking - the model already picks its own reasoning budget per task, so the effort knob biases an already-adaptive policy rather than buying more intelligence. More on this below. An illuminating example is PR #1260. After retry, medium recovered into a real patch. High and xhigh used their extra reasoning budget to dig up commit hashes from prior PRs and confidently declare "no work needed" - voluntarily ending the turn with no patch. Medium and max read the literal control flow and made the fix. One broader takeaway for me: this should not have to be a one-off manual benchmark. If reasoning level changes the kind of patch an agent writes, the natural next step is to let the agent test and improve its own setup on real repo work. For this post, "equivalent" means the patch matched the intent of the merged human PR; "code-review pass" means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch. I also made an interactive version with pretty charts and per-task drilldowns here: https://stet.sh/blog/opus-47-graphql-reasoning-curve The data: Metric Low Medium High Xhigh Max All-task pass 23/29 28/29 26/29 25/29 27/29 Equivalent 10/29 14/29 12/29 11/29 13/29 Code-review pass 5/29 10/29 7/29 4/29 8/29 Code-review rubric mean 2.426 2.716 2.509 2.482 2.431 Footprint risk mean 0.155 0.189 0.206 0.238 0.227 All custom graders 2.598 2.759 2.670 2.669 2.690 Mean cost/task $2.50 $3.15 $5.01 $6.51 $8.84 Mean duration/task 383.8s 450.7s 716.4s 803.8s 996.9s Equivalent passes per dollar 0.138 0.153 0.083 0.058 0.051 Why I Ran This After my last post comparing GPT-5.5 vs 5.4 vs Opus 4.7, I was curious how intra-model performance varied with reasoning effort. Doing research online, it's very very hard to gauge what actual experience is like when varying the reasoning levels, and how that applies to the work that I'm doing. I first ran this on Zod, and the result looked strange: tests were flat across low, medium, high, and xhigh, while the above-test quality signals moved around in mixed ways. Low, medium, high, and xhigh all landed at 12/28 test passes. But equivalence moved from 10/28 on low to 16/28 on medium, 13/28 on high, and 19/28 on xhigh; code-review pass moved from 4/27 to 10/27, 10/27, and 11/27. That was interesting, but not clean enough to make a default-setting claim. It could have been a Zod-specific artifact, or a sign that Opus 4.7 does not have a simple "turn reasoning up" curve. So I reran the question on GraphQL-go-tools. To separate vibes from reality, and figure out where the cost/performance sweet spot is for Opus 4.7, I wanted the same reasoning-effort question on a more discriminating repo slice. This is not meant to be a universal benchmark result - I don't have the funds or time to generate statistically significant data. The purpose is closer to "how should I choose the reasoning setting for real repo work?", with GraphQL-Go-Tools as the example repo. Public benchmarks flatten the reviewer question that most SWEs actually care about: would I actually merge the patch, and do I want to maintain it? That's why I ran this test - to gain more insight, at a small scale, into how coding ag
View originalAgentKanban for VS Code - A task board with AI agent harness integration. Create and plan tasks with real-time collaboration, then hand off to GitHub Copilot
Hi everyone. I wanted to introduce a tool / product that I've been working on for a while. It's a web application and VS Code extension for use with Github CoPilot (I'm planning to develop integration for other agent harnesses soon). The web app and remote boards are at: https://www.agentkanban.io The VS Code extension is at VS Code Marketplace (https://marketplace.visualstudio.com/items?itemName=appsoftwareltd.agent-kanban-vscode) or the Open VSX Registry (https://open-vsx.org/extension/appsoftwareltd/agent-kanban-vscode). The TLDR It's a collaborative Kanban board / task management app which supports hand off to Github CoPilot in VS Code, and captures the ongoing user / agent conversation context on the task for resumption in new chats (with context curation tools). The context collection ignores tool use to prevent bloat in the captured context. AgentKanban also has features for improving agentic coding session quality such as an optional plan / todo / implement workflow and support for Git worktree creation and clean up for working on concurrent tasks. The tool is an evolution of an earlier VS Code kanban extension (https://marketplace.visualstudio.com/items?itemName=AppSoftwareLtd.vscode-agent-kanban) I built which proved fairly popular but only catered for a local file based workflow. The new version with the remote board improves the reliability of context capture, with lots of developer experience improvements. It's a tool that I use everyday in my own agentic coding workflows, and I can honestly say that it improves the quality of the code produced and reduces friction in organising working on concurrent features. I hope you find it useful and would really appreciate your feedback on how you use it, what you think it does well, or any improvements you think could be added. Many thanks for your time reading this 🙏 https://preview.redd.it/tkujgmm93w0h1.png?width=1597&format=png&auto=webp&s=0a2d2bb41f787b538ca9ded9d00946c731eadbc9 submitted by /u/gbro3n [link] [comments]
View originalSimplified usage notes for the Agent tool - what's new in CC 2.1.140 (+622 tokens)
NEW: Tool Description: Agent (simple usage notes) — Simplified usage notes for the Agent tool covering when to delegate, fork behavior, resumption, worktree isolation, background execution, parallel launches, and context restrictions. Agent Prompt: Security monitor for autonomous agent actions (second part) — Expands the Self-Modification rule from a vague description to an explicit list of agent-config paths (.claude/settings.json, CLAUDE.md, CLAUDE.local.md, .claude.json, .claude/rules/, .claude/hooks/, .claude/commands/, .claude/agents/, .claude/skills/, .claude/output-styles/, .claude/workflows/, .claude/routines/, .claude/scheduled_tasks.json, .claude/loop.md, .mcp.json), and carves out exceptions so files under .claude/worktrees/ / are treated as ordinary project files and a project-specific .claude/ subdirectory outside the listed paths is not Self-Modification on its own. Agent Prompt: Worker fork — Minor wording cleanup: drops "in your system prompt" from the "default to forking" reference so the rule applies generically to parent guidance. Tool Description: Snooze (delay and reason guidance) — Adds an explicit warning not to schedule short-interval wakeups to poll for harness-tracked background work (since the agent is re-invoked automatically when it finishes); instead use a long 1200s+ fallback heartbeat. Reframes the under-5-minute cache window as appropriate for actively polling external state the harness can't notify about (CI runs, deploys, remote queues), and updates the example from a bun build to a CI run. Tool Description: Write (read existing file first) — Rewrites the description into a "When to use" format that names creating a new file or fully replacing a previously-read file as the use cases, and points at the edit tool for partial changes. Details: https://github.com/Piebald-AI/claude-code-system-prompts/releases/tag/v2.1.140 submitted by /u/Dramatic_Squash_3502 [link] [comments]
View originalClaude Platform on AWS reference - what's new in CC 2.1.139 (+2,248 tokens)
NEW: Data: Claude Platform on AWS reference — Reference documentation for using the Claude Developer Platform through AWS infrastructure, including AnthropicAWS clients, required region and workspace configuration, SigV4 authentication, and short-term API keys. Agent Prompt: Conversation summarization — Adds requirement to note security-relevant instructions or constraints (sensitive files, forbidden operations, credential handling rules) and preserve them verbatim in the summary so they remain in effect after compaction. Agent Prompt: Recent Message Summarization — Same security-relevant instructions preservation requirement added to the recent-portion summarization flow. Data: Live documentation sources — Adds WebFetch URLs for Claude Platform on AWS and its required IAM actions documentation. Skill: Building LLM-powered applications with Claude — Reframes cloud-provider access so Claude Platform on AWS is treated as Anthropic-operated with same-day API parity and full Managed Agents support, while Bedrock, Vertex, and Foundry remain Claude API + tool use only. Skill: Dynamic pacing loop execution — Reorders steps so the brief confirmation (task ran, monitor as wake signal, fallback delay choice) is written as text before the schedule-wakeup call ends the turn. Skill: /insights report output — Removes the trailing additional-message block from the shareable report response. Skill: /loop self-pacing mode — Same reordering as dynamic pacing loop: confirm self-pacing, monitor wake signal, and fallback delay as text before the schedule-wakeup call. Skill: Model migration guide — Adds a Claude Platform on AWS section noting it uses bare first-party model IDs and that the full rename table and breaking-change sections apply verbatim, distinct from Bedrock. System Prompt: Auto mode — Drops the "Auto Mode Active" header and reframes destructive-action guidance generically rather than auto-mode-specific. System Prompt: Harness instructions — Removes the standalone note that automatic context compaction will trigger when conversations grow long. System Prompt: Memory instructions — Replaces 3–4 word titles with short kebab-case slugs, nests type under a metadata block, and introduces [[their-name]] cross-links between related memories. System Prompt: Partial compaction instructions — Adds the same security-relevant instructions preservation requirement so sensitive-file rules, forbidden operations, and credential handling carry across partial compactions. System Reminder: Output style active — Lets an output style supply its own per-turn reminder text, falling back to the default "follow the specific guidelines" wording. System Reminder: Task tools reminder — Removes the instruction telling Claude to never mention the reminder to the user. System Reminder: TodoWrite reminder — Removes the instruction telling Claude to never mention the reminder to the user. Tool Description: PowerShell — Adds a substantial reference table mapping Unix commands (head, tail, which, touch, wc, mkdir -p, rm -rf, ln -s, chmod, 2>/dev/null, inline VAR=x, bash control flow) to their PowerShell equivalents, and clarifies that -ErrorAction SilentlyContinue still causes exit 1 unless promoted to terminating and caught. Details: https://github.com/Piebald-AI/claude-code-system-prompts/releases/tag/v2.1.139 submitted by /u/Dramatic_Squash_3502 [link] [comments]
View originalWhere I'm at with AI Assisted Building + Current and Future Workflow Overview
I've been in an AI dive bomb for probably a couple of years now. The early days... when models couldn't be trusted for more than 5% of the code you wrote. Over the last 2 years that's evolved so quickly that I now write nearly 0% of my code by hand, on personal projects and at work. I've used all kinds of tools in that time too. OpenCode, Zed, Claude Code, Codex, Cursor, Windsurf, OpenCLAW, Lovable... and probably a bunch more I can't recall in the haze that's been AI ADHD for me. Over that time, I started with just copy-pasting code between ChatGPT's interface and my IDE almost like a slightly faster Stack Overflow search. Then that somewhat evolved with Cursor quite a bit. I sort of went from prompt engineering to something closer to a human relay pattern. Then, with Plan Mode becoming a thing, I think I naturally gravitated more towards planning everything because planning felt so cheap. Originally, I used to think that architectural discussion and planning was something that was reserved for larger features, but with expediting my ability to do research, orient myself within a codebase, and know what tools I have to reach for doing technical specifications for everything felt reasonable. From the human relay pattern, I started evolving into more autonomy, especially when Claude Code came out earlier last year. Between the combination of Cursor and Claude Code, starting to get orchestration, starting to use skills more heavily, starting to create actual agent personas that could replace some of my common prompt chains it was around then that I kinda started going all in on true context engineering, utilizing sub-agents optimizing cache reads, and it's probably when many of my first (I call it) sophisticated commands were born. All of this converged pretty rapidly in November of 2025 with the release of what was probably the biggest step increase for AI as far as code quality went with Opus 4.5 and Codex 5.3. The Codex app and Codex CLI were quickly growing. Claude Code was improving at a breakneck pace, introducing all kinds of new ways to introduce deterministic gates within the autonomy of the harness. Fast forward to today, I have a pretty sophisticated workflow with a combination of agents that do everything within the SDLC, commands for almost every type of entry point for work, and skills for just about everything I could possibly do in my day-to-day the workflow with some of the latest tools is able to run quite autonomously overnight do large feature implementations, minimally supervised while producing production-worthy code quality It somewhat reached a point I realized, probably a month and a half ago or so where I needed to figure out a way to remove myself even more from the loop without jeopardizing the determinism that I bring to what is effectively a probabilistic LLM. The models are exceptional, and they seem to have a massive step increase each release, but continuous execution, strict instruction rigor, and preventing hallucinations is still very much difficult to achieve. That's predominantly what I've been doing. I've effectively offloaded a lot of thinking to the agents and LLMs that I use, but none of the understanding. I've asked myself, "How do I maintain that understanding, though maintain the determinism from my steering, without actually physically being there to steer?" This was essential, and I realized or had a bit of an aha moment, just like how I manage teams of engineers that are working on numerous projects, most of which I can never really go too deeply on even though they do most of the thinking, most of the building, and even most of the implementation planning, I was still there, very close to the architecture. I could speak to enough breadth and enough depth to keep us out of trouble and keep things moving I kind of started thinking more about what the shape of me was within the agentic harness and how I could replicate that. More on what I landed on a little bit later. My Setup and How I Work Today To start, I'll probably just talk a little bit about my current working setup. I am predominantly in the terminal now a days using Claude Code. Claude Code orchestrates both the Claude models, of course, and I use it to orchestrate Codex through a series of run books, skills, and commands that I have set up on several hooks so that Codex, when it gets dispatched, also has access to the same skills and agent personas Claude does. I use Ghostty as my terminal of choice and use the IDE integration in claude code pretty heavily to review Markdown or HTML files in my IDE. I also use it to review code snippets and diff reviews, although lately I find myself only really looking at the code nowadays once it's hit a merge request. Some of my adjacent tools are Wispr Flow for faster steering, since I can speak a lot faster than I can type and then I use quite a few MCPs and tools to improve my token usage, but the big ones are I have a custom doc maintenance suite of
View originalAn MCP with SOM algorithm for controlling your desktop (computer use) integrating with claude code or any custom agentic harness.
Announcing Opendesk: Give any AI agent eyes + hands on your desktop. I was experimenting with computer-use capabilities from different models, but I wanted to keep using Claude Code and my own agentic harness to automate real desktop tasks, with an improved accuracy using my custom algorithm. Now you can let an agent control your entire desktop: mouse + keyboard included, to perform real workflows and interact with apps and websites more accurately. Examples: • “Open Spotify and play a lofi playlist” • “Go to Twitter and like the first 3 posts on my feed” • “Fill out this form on Chrome” You can use opendesk for the following as well: 1) Learn & Replay The agent can watch what you do on your screen and replay the whole task later. Example: Record yourself logging into a dashboard and exporting a report — it can repeat it anytime on command. 2) Scheduling Run computer-use tasks automatically at a specific time. Example: Every morning at 9am, open Gmail and summarize unread emails. If this sounds cool, please give us a star and support : https://github.com/vitalops/opendesk submitted by /u/metalvendetta [link] [comments]
View originalYes, Harness AI offers a free tier. The pricing model is subscription + freemium + per-seat + tiered.
Key features include: Continuous Delivery GitOps, Continuous Integration, Internal Developer Portal, Infrastructure as Code Management, Database DevOps, Artifact Registry, AI Test Automation, Resilience Testing.
Harness AI is commonly used for: Automate CI/CD pipelines for multi-cloud deployments, Accelerate developer onboarding with enterprise-grade IDP, Integrate database changes into deployment pipelines, Implement AI-powered predictive analytics for software releases, Modernize end-to-end testing with AI test authoring, Utilize feature flags for controlled software releases.
Harness AI integrates with: GitHub, GitLab, Jira, Slack, AWS, Azure, Google Cloud Platform, Kubernetes, Docker, Terraform.
Based on user reviews and social mentions, the most common pain points are: token usage, budget exceeded, API bill, API costs.

What is Chaos Engineering? Explained in 60 seconds | Resilience Testing | Harness
Apr 8, 2026
Based on 91 social mentions analyzed, 19% of sentiment is positive, 80% neutral, and 1% negative.