Why context quality determines code quality
Users generally appreciate Augment Code for its ability to streamline and enhance coding tasks, with specific mentions of improved automation and access to cutting-edge AI and coding techniques. Key complaints revolve around occasional bugs and inefficiencies, particularly in handling complex codebases and token usage, which some users find to be resource-intensive. The pricing sentiment is mixed, with users acknowledging the value it offers but noting that the costs can add up, especially with heavy use. Overall, Augment Code maintains a favorable reputation for its innovation in AI-assisted coding, although users suggest room for further refinement and cost optimization.
Mentions (30d)
5
Reviews
0
Platforms
2
Sentiment
23%
7 positive
Users generally appreciate Augment Code for its ability to streamline and enhance coding tasks, with specific mentions of improved automation and access to cutting-edge AI and coding techniques. Key complaints revolve around occasional bugs and inefficiencies, particularly in handling complex codebases and token usage, which some users find to be resource-intensive. The pricing sentiment is mixed, with users acknowledging the value it offers but noting that the costs can add up, especially with heavy use. Overall, Augment Code maintains a favorable reputation for its innovation in AI-assisted coding, although users suggest room for further refinement and cost optimization.
Features
Use Cases
Industry
information technology & services
Employees
75
Funding Stage
Venture (Round not Specified)
Total Funding
$252.0M
Pricing found: $20 /month, $60 /month, $200 /month, $20, $60
Where I'm at with AI Assisted Building + Current and Future Workflow Overview
I've been in an AI dive bomb for probably a couple of years now. The early days... when models couldn't be trusted for more than 5% of the code you wrote. Over the last 2 years that's evolved so quickly that I now write nearly 0% of my code by hand, on personal projects and at work. I've used all kinds of tools in that time too. OpenCode, Zed, Claude Code, Codex, Cursor, Windsurf, OpenCLAW, Lovable... and probably a bunch more I can't recall in the haze that's been AI ADHD for me. Over that time, I started with just copy-pasting code between ChatGPT's interface and my IDE almost like a slightly faster Stack Overflow search. Then that somewhat evolved with Cursor quite a bit. I sort of went from prompt engineering to something closer to a human relay pattern. Then, with Plan Mode becoming a thing, I think I naturally gravitated more towards planning everything because planning felt so cheap. Originally, I used to think that architectural discussion and planning was something that was reserved for larger features, but with expediting my ability to do research, orient myself within a codebase, and know what tools I have to reach for doing technical specifications for everything felt reasonable. From the human relay pattern, I started evolving into more autonomy, especially when Claude Code came out earlier last year. Between the combination of Cursor and Claude Code, starting to get orchestration, starting to use skills more heavily, starting to create actual agent personas that could replace some of my common prompt chains it was around then that I kinda started going all in on true context engineering, utilizing sub-agents optimizing cache reads, and it's probably when many of my first (I call it) sophisticated commands were born. All of this converged pretty rapidly in November of 2025 with the release of what was probably the biggest step increase for AI as far as code quality went with Opus 4.5 and Codex 5.3. The Codex app and Codex CLI were quickly growing. Claude Code was improving at a breakneck pace, introducing all kinds of new ways to introduce deterministic gates within the autonomy of the harness. Fast forward to today, I have a pretty sophisticated workflow with a combination of agents that do everything within the SDLC, commands for almost every type of entry point for work, and skills for just about everything I could possibly do in my day-to-day the workflow with some of the latest tools is able to run quite autonomously overnight do large feature implementations, minimally supervised while producing production-worthy code quality It somewhat reached a point I realized, probably a month and a half ago or so where I needed to figure out a way to remove myself even more from the loop without jeopardizing the determinism that I bring to what is effectively a probabilistic LLM. The models are exceptional, and they seem to have a massive step increase each release, but continuous execution, strict instruction rigor, and preventing hallucinations is still very much difficult to achieve. That's predominantly what I've been doing. I've effectively offloaded a lot of thinking to the agents and LLMs that I use, but none of the understanding. I've asked myself, "How do I maintain that understanding, though maintain the determinism from my steering, without actually physically being there to steer?" This was essential, and I realized or had a bit of an aha moment, just like how I manage teams of engineers that are working on numerous projects, most of which I can never really go too deeply on even though they do most of the thinking, most of the building, and even most of the implementation planning, I was still there, very close to the architecture. I could speak to enough breadth and enough depth to keep us out of trouble and keep things moving I kind of started thinking more about what the shape of me was within the agentic harness and how I could replicate that. More on what I landed on a little bit later. My Setup and How I Work Today To start, I'll probably just talk a little bit about my current working setup. I am predominantly in the terminal now a days using Claude Code. Claude Code orchestrates both the Claude models, of course, and I use it to orchestrate Codex through a series of run books, skills, and commands that I have set up on several hooks so that Codex, when it gets dispatched, also has access to the same skills and agent personas Claude does. I use Ghostty as my terminal of choice and use the IDE integration in claude code pretty heavily to review Markdown or HTML files in my IDE. I also use it to review code snippets and diff reviews, although lately I find myself only really looking at the code nowadays once it's hit a merge request. Some of my adjacent tools are Wispr Flow for faster steering, since I can speak a lot faster than I can type and then I use quite a few MCPs and tools to improve my token usage, but the big ones are I have a custom doc maintenance suite of
View originalArkon: turning Claude from a personal chatbot into a managed organizational resource
Sharing a project I've been building. Not asking for anything in particular - just thought the problem and approach might be interesting to some folks here. The problem Most companies adopting LLMs hit the same wall: every employee uses ChatGPT or Claude individually, copy-pastes confidential docs into random chats, and the org has zero visibility or control. The "AI rollout" is really just a license purchase plus a prayer. On the other end, the heavy enterprise solutions (custom RAG platforms, Glean-style tools) are expensive, complex, and overkill for most mid-sized teams. There's a missing middle: small-to-medium organizations that want their employees to use Claude productively, but with proper access control, shared knowledge, and no manual context-pasting every single time. The approach Arkon sits between the org and Claude. Admins manage knowledge centrally. Employees connect to Arkon via MCP (Model Context Protocol) and automatically get the right context for who they are, without configuring anything. Two realms: Global Knowledge - org-wide docs and wiki, scoped by department. A finance person sees finance docs, an engineer sees engineering docs. Admins decide who sees what. Workspaces - smaller scopes for projects, teams, or cross-functional initiatives. Membership-gated. Your global role doesn't bleed into workspaces - you only see workspaces you're a member of. The MCP integration means employees keep using Claude the way they already do (Claude Desktop, Claude Code, whatever client they prefer). They don't learn a new tool. They just suddenly have org context available when they need it. How wiki generation actually works This is the part I think is interesting and slightly different from typical RAG setups. Arkon isn't a retrieval-augmented chatbot. It's an LLM-generated wiki layer. When you upload a document - say a 300-page handbook - Arkon uses an LLM to analyze the structure and produce a hierarchical wiki. If the source has clear headings, the wiki follows them. If not, the LLM clusters content by topic semantically. The output is a browsable, organized internal reference, not a linear summary. I'm honest with users about the tradeoff: LLM-generated content has no guarantee of accuracy, especially for deep domain material. So there's a human-in-the-loop layer in the roadmap - employees can flag, annotate, and edit wiki content. The LLM does the organizational heavy lifting; humans own final correctness. Permissioning lessons learned The biggest design pivot so far: I initially had roles carry both what you can do and what you can do it on in one bag. This led to a classic bug - give a user "read documents" and suddenly they could read every document in the org, ignoring department scope. Fixed it by splitting cleanly: Permissions are scoped strings: doc:read:own_dept vs doc:read:all Workspaces are pure membership checks - global roles cannot grant workspace access, ever Two realms, fully independent If anyone is building org-level permission systems, that separation is worth getting right early. Retrofitting it is painful. Repo: github.com/nduckmink/arkon Happy to answer questions about architecture, MCP integration, or the permission model. Feedback and criticism welcome - especially from anyone who has built or used internal knowledge systems and seen what works and what doesn't. submitted by /u/Glass-Statistician97 [link] [comments]
View originalWhat's new in CC 2.1.124 (+166 tokens) and CC 2.1.126 (-87 tokens)
NEW: System Reminder: File modification detected (budget exceeded) — Tells the agent when a user or linter changed a file but the diff was omitted because other modified files already exceeded the snippet budget, and directs it to read the file if current content is needed. System Prompt: Harness instructions — Replaces the core-identity function call with explicit introductory-line and security-note insertion points before the shared harness instructions. System Prompt: REPL tool usage and scripting conventions — Clarifies that thenable shorthand results are auto-awaited only at return time, so inline uses such as concatenation, templates, or arguments to another call must be awaited first. Details: https://github.com/Piebald-AI/claude-code-system-prompts/releases/tag/v2.1.124 REMOVED: System Reminder: Malware analysis after Read tool call — Removed the reminder that asked agents to consider whether each file read is malware and to analyze malware without improving or augmenting it. Details: https://github.com/Piebald-AI/claude-code-system-prompts/releases/tag/v2.1.126 submitted by /u/Dramatic_Squash_3502 [link] [comments]
View originalReleasing the Data Analyst Augmentation Framework (DAAF) version 2.1.0 today -- still fully free and open source! In my very biased opinion: DAAF is now finally the best, safest, AND easiest way to get started using Claude Code for responsible and rigorous data analysis
https://preview.redd.it/o74lppqd86zg1.png?width=1456&format=png&auto=webp&s=3a904bae42b8130e2c6382be55debe8f6ef4d6ca When I launched the Data Analyst Augmentation Framework v2.0.0 six weeks ago, I wrote that the major update was about going “from usable to useful” -- rebuilding the orchestrator system for maximum flexibility and efficiency, adding a variety of more responsive engagement modes, and deepening the roster of methodological knowledge that DAAF could pull upon as needed for causal inference, geospatial analysis, science communication and data visualization, supervised and unsupervised machine learning, and much, much more. But while DAAF continued to get more capable and more useful for those actually using it… Well, it was still extremely annoying to use, generally obtuse, and hard to get started with, which means a lot of people who were interested were simply bouncing off of it. That all changes with the v2.1.0 update, which I’m cheekily calling the Frictionless Update for three key reasons: 1. Installation happens in one line now From a fresh computer to talking with a DAAF-empowered Claude Code in no more than ten minutes on a decent internet connection. This is really it: https://preview.redd.it/tiglwl3f86zg1.png?width=1038&format=png&auto=webp&s=3ec92cf797af5e0b91a2d46ef8cfb2976cbff802 Which means it’s easier than ever to get started with Claude Code and DAAF in a highly curated, secure environment. To that point, you still need Docker Desktop installed (I’ll talk about that more in a sec), but no more faffing about with a bunch of ZIP file downloads and commands in the terminal. The simplicity of this is even crazier, given that… 2. DAAF now comes bundled with everything you need to make it your main AI-empowered research environment No more messing around with external programs, installations, extensions, etc., it just works from the get-go with everything you need to thrive in your new AI-empowered research workflows with Claude from the moment you run the install line. https://preview.redd.it/q3pdj36g86zg1.png?width=1456&format=png&auto=webp&s=56ed822da68e773a9b7253ce6aa5a95abc057788 Thanks to code-server, DAAF automatically installs a fully-featured version of VSCode in the container, accessible in your favorite browser: file editing, version control management, file uploads and downloads, markdown document previews, smart code editing and formatting, the works. Reviewing and editing whatever you work on with DAAF has never been easier. DAAF also now comes with an in-depth and interactive session log browser that tracks everything Claude Code does every step of the way. See its thinking, what files it loads and references, which subagents it runs, and look through any code its written, read, or edited across any project/session/etc. Full auditability and transparency is absolutely mission-critical when using AI for any research work so you can truly verify everything its doing on your behalf and form a much more refined and critical intuition for how it works (and how/when/why it fails!). Some of the most important failure modes I’ve discovered with AI assistants (DAAF included) is it simply doesn’t load the proper reference materials or follow workflow instructions; this is the single most important diagnostic tool to identify and fight said issues, which I frankly think everyone should be doing in any context with LLM assistants. This took a lot of elbow-grease, but I think it’s the single most important thing I could do to help people actually understand what the heck Claude Code gets up to and review its work more thoroughly. https://preview.redd.it/jkocy45h86zg1.png?width=1456&format=png&auto=webp&s=6848b5a01ef958fa051a3246a1e6b13beef91e80 These two big new bundled features are in addition to installing Claude Code, the entire DAAF orchestration system, bespoke references to facilitate Claude’s rigorous application of pretty much every major statistical methodology you’ll need, deep-dive data documentation for 40+ datasets from the Urban Institute Education Data Portal, curated Claude permissioning systems and security defenses, automatic context and memory management protocols designed for reproducible research workflows, and a high-performance and fully reproducible Python data science/analysis environment that just works -- no need to worry about dependencies, system version conflicts, or package management hell. https://preview.redd.it/wzaotr5i86zg1.png?width=1456&format=png&auto=webp&s=91390402dfe3666a90472f6e878364ddcd1fb740 With the magic of Docker, everything above happens instantly and with zero effort in one line of code from your terminal. And perhaps most importantly (and why I will keep dying on the hill of trying to get people to use Docker): setting up DAAF and Claude Code in this Docker environment offers critical guardrails (like firewalling off its file access to only those things you explicitly allow) and security (like creating a convenient sy
View originalhalf-deployed AI projects haunt my github
Got 47 repos that start with 'just playing with Claude' or 'testing Llama 4 on'. Every single one dead after three commits. Like you get this spark, right? Midnight scrolling leads to some random implementation of retrieval-augmented generation for your personal notes. Brain goes full steam. You're already planning the deployment pipeline while pip installing transformers. Then day two hits. The model's hallucinating your grocery lists into poetry (weirdly beautiful but useless). Your GPU's crying. And suddenly you remember you have actual work that pays actual money. But here's the thing that gets me. These aren't just abandoned experiments, they're digital ghosts of pure optimism. Each one represents that exact moment when everything seemed possible, when you thought you'd crack the code this time, when the future felt close enough to touch. Now I scroll past them looking for that one functional script I actually need. Graveyard of good intentions, all named some variation of 'ai-helper-v2-final-actually-final'. Anyone else got a git log that reads like a museum of broken dreams? submitted by /u/NefariousnessLow9273 [link] [comments]
View originalRecommended Plugins/Tooling/Tips for managing Ansible ( Code Base Hygiene/Documentation Management/Workflow) via Claude?
I'm a Linux Sysadmin rather than a Dev, and I have recently discovered how much Claude has levelled up recently, and can see many different ways it can not just augment code writing and debugging but also with workflow optimisation and admin toil. I work mainly in Ansible for automation, and have one primary git repo for my codebase at work, we're a relatively small team/environment. I work in quite a toil heavy, reactive environment and have had a creeping documentation backlog for the last few months, but basically how I'm planning to use Claude is to: Analyse my code base, track down inconsistencies, errors, flag potential security risks Also hook into my AWX server's API and other APIs to information gather on the setup there. (both the above will then form the basis of a scripted weekly Team code hygiene report). Read my existing documentation to get an idea on document template structure, formatting and my writing style. Whilst it is doing all the above maintaining ongoing tracking and recording of pertinent reference information on coding style and standards, in-use conventions and code structures cross referenced with information in the Docs to build a cohesive technical understanding of my code base. Leverage this to draft process documents, fed back into Claude to further clarify and improve it's understand (for values of LLM) of As I am working with it on new projects and actively discussing design choices, this context can be further used in fresh documentation, with any changes in process or standard config then backported to other common areas of code and documentation to ensure everything I have a coherent whole at both technical and documentation level. 7, Further branch out my documentation into Standards and Processes, training materials, reference guides for Dev Teams and other stakeholders, quick reference materials, you name it. It's light years ahead of Copilot/ChatGPT in terms of both depth of both technical comprehension for troubleshooting and debugging in and out of code (again for values of LLM), but I'm actually even more excited about it's potential as workflow optimisation tool. This is not only going to help dig me out of my current toil backlog but fill in the hole and concrete over it afterwards. I've been optimising my setup to be token efficient already and have have already created a number of dynamically loading custom skills such as a coding-mode that loads all my technical conventions, coding best practices and structure templates, a doc-mode that loads comprehension within the scope of documentation writing, and other skills for updating files containing Claude's tracking of any changes, and another for triggering consistency checks across multiple documents. I am however relatively unfamiliar with the wealth of 3rd party plugins and other tooling to augment Claude, so my question is - can anybody make any recommendations for any extra tooling or features out there that I might use to further leverage or optimise what I'm trying to achieve here, or otherwise offer any useful tips or suggestions I may not be aware of, before I go reinventing any wheels too much? Thanks in advance! submitted by /u/motorleagueuk-prod [link] [comments]
View originalHow to give Claude Code 'Cursor AI' goggles
Recently used Cursor AI (free tier for 3 free queries a month) to resolve an issue in 10 mins that Claude Code Opus could not resolve in 2 hours. Simple reason was that Cursor quickly got a grasp on meaningful end to end parity relationships between my entire codebase and quickly hunted down the culprit. I was impressed and then I had questions. Cursor charges almost the SAME sub cost $ as Claude code yet it is NOT an LLM. Its a bunch of powerful proprietary toolsets designed to make your LLM "see" your code correctly. Cursor is a "holistic" augmented IDE that uses real-time indexing and background linting to assist your active coding flow, blah blah blah. Claude Code on the other hand is a top-down autonomous agent that plans and executes sequentially. They both do the same 'sort' of thing but try to get to similiar results very differently. Disclaimer - by the way CC is way more useful and powerful overall lets not kid outselves. Being the 'resourceful' person I like to pretend I always am I tried to approximate this type of capability in Claude Code. Heres what I got below. PS I used AI to format this table and content below so dont drag me over the coals MCP Server Functional Benefit Cursor AI Equivalent mcp-code-search Semantic Index: Maps the "meaning" of your code so you can search for concepts (e.g., "how we handle phase") rather than just exact text. u/Codebase / Semantic Search lsp (via clangd) Symbolic Map: Understands the "laws" of C++. It traces ripples, finds every reference of a function, and jumps to definitions with 100% precision. "Go to Definition" / Symbol Indexing mcp-memory Persistent Brain: Remembers architectural decisions and project rules across different days and sessions so I don't have to "re-learn" your project. (Cursor lacks persistent memory) filesystem Direct Access: Gives me high-speed read/write access to your local project folders without me having to "ask" for file contents repeatedly. Integrated Explorer sequential-thinking Logic Scratchpad: Allows me to break down complex bugs (like your IPC state-machine issues) into steps before I touch a single line of code. "Advanced Reasoning" mode I used Opus to run some comparison tests and apparently i am like at 70- 80% functional parity with Cursor AI although thats hard to actually quantify. I also ask it stuff at the conclusion of my conversation like 'how much longer would this have taken you without the so and so MCPs Cursor AI powers you've now got? and mostly very positive 'reviews' from claude code and comparitive proof (which are really just estimations I know!) Few more notes ------------------- -use Claude Code itself to install\ configure these MCPS yourself Youll save yourself a lot of stuffing around TRUST ME! -Use a Post-Edit Re-index Hook to keep your data fresh (avoids having to remember to reindex your codebase manually every new session) -update your claude.md file to prioritise your nav tools so that it can take advantage of your newly added search tools (example only text below) Navigation: LSP first, then MCP (`juce-docs`, `memory`, `code-search`), then Grep/Glob as fallback. What I have personally noticed in 4 weeks of use? -------------------------------------------- Lets me preface by saying I know my codebase and I've got a good grasp on what is considered implementation 'success' for MY project and what baseline methods I used to help CC get me there as accurately and fast as possible for the last 6 months. What have I noticed now? Snappier more contextual processing\ graph based searching of my codebase (no blind grepping it actually 'walks the graph' not just a keyword search, jumps to relevant files rather than scanning my whole repo every time) , better ripple edits (less guessing + quickly detects cross file impact) , better total hit rates, more tailored targetted responses, + just piece of mind that I've got that 'extended' type of capability when and if helpful. Im sure at least some of this is placebo but if I trust Opus to help me write entire applications then I should technically also be taking it at face value when its outright telling me that these tools have proven measurably useful in getting faster more accurate results at the end of the session. Anyway thought to post here in case someone else was interested in giving it a go and seeing what mileage they may get out of it. Peace..... submitted by /u/ThesisWarrior [link] [comments]
View originalWhat's new in CC 2.1.124 (+166 tokens) and 2.1.126 (-87 tokens) system prompt
NEW: System Reminder: File modification detected (budget exceeded) — Tells the agent when a user or linter changed a file but the diff was omitted because other modified files already exceeded the snippet budget, and directs it to read the file if current content is needed. System Prompt: Harness instructions — Replaces the core-identity function call with explicit introductory-line and security-note insertion points before the shared harness instructions. System Prompt: REPL tool usage and scripting conventions — Clarifies that thenable shorthand results are auto-awaited only at return time, so inline uses such as concatenation, templates, or arguments to another call must be awaited first. Details: https://github.com/Piebald-AI/claude-code-system-prompts/releases/tag/v2.1.124 REMOVED: System Reminder: Malware analysis after Read tool call — Removed the reminder that asked agents to consider whether each file read is malware and to analyze malware without improving or augmenting it. Details: https://github.com/Piebald-AI/claude-code-system-prompts/releases/tag/v2.1.126 submitted by /u/Dramatic_Squash_3502 [link] [comments]
View originalCodebase-scale retrieval using AST-derived graphs + BM25 — reducing LLM context from 100K to 5K tokens [D]
Wanted to share an approach I've been using for retrieval-augmented generation over large codebases and get feedback from people thinking about similar problems. The problem Naive codebase RAG typically works by chunking files into text segments and embedding them for similarity search. This breaks down on code because semantic similarity at the chunk level doesn't capture structural relationships — a function in file A calling a type defined in file C won't surface that dependency through embedding proximity alone. The approach: AST-derived typed graphs Instead of chunking, I parse every file using Tree-sitter into its AST, then extract a typed node/edge graph: Nodes: functions, classes, interfaces, types, modules Edges: imports, exports, call relationships, inheritance, composition This gets stored in SQLite as a persistent graph. Parse cost is one-time per project. Retrieval: BM25 over graph nodes At query time, instead of embedding similarity, I run BM25 scoring over node metadata (names, signatures, docstrings, file paths). Top-scoring nodes get passed to the LLM. The graph structure means a retrieved function automatically pulls in its direct dependencies via edge traversal. Empirically this lands at ~5K tokens per query on medium-large codebases that would otherwise require ~100K tokens with naive full-context approaches. Hierarchical fallback for complex queries For multi-file reasoning tasks: A Mermaid diagram of the full graph serves as a persistent architectural map always in context BM25 node retrieval handles targeted lookup At 70% context capacity, a fast model compresses least-relevant nodes before passing to the primary model Why BM25 over embeddings here Code identifiers (function names, type names, module paths) are highly distinctive lexically. BM25 outperforms embedding similarity on exact and near-exact identifier matching, which is the dominant retrieval pattern in code queries. Embeddings would likely help more for natural language docstring queries — haven't benchmarked that comparison rigorously yet. Open questions I'm still thinking about: Better edge-weighting strategies for the graph — currently all edges are unweighted Whether re-ranking with a cross-encoder would meaningfully improve precision over BM25 alone Handling dynamic languages where call graphs can't be fully resolved statically Has anyone tackled codebase-scale RAG differently? Particularly curious if anyone's compared AST-graph approaches against embedding-based chunk retrieval on real codebases with quantitative benchmarks. submitted by /u/Altruistic_Night_327 [link] [comments]
View originalTaught my 60-year-old dad (zero coding exp) Claude and Git in Feb. Today he built a RAG solution. I finally get "vibe coding."
My father teaches geology and has literally zero coding expertise. Back in February, I introduced him to Claude and taught him the absolute basics of how Git works. Fast forward to today: he actually implemented a functional RAG (Retrieval-Augmented Generation) solution for analyzing and querying his mineral documents. Seeing this happen made me finally understand why "vibe coding" has become such a thing. Don't get me wrong, I know a proper end-to-end solution engineer or architect is still leagues ahead of someone just prompting an AI. But it is surprisingly impressive how Claude Code can take a 60-year-old with absolutely zero experience and elevate him to the level of an average developer. submitted by /u/Longjumping-Host-617 [link] [comments]
View originalOpen-source 9-task benchmark for coding-agent retrieval augmentation. Per-task deltas +0.010 to +0.320, all evals reproducible [P]
Sharing an open-source benchmark suite (paper-lantern-challenges) that measures coding-agent performance with vs without retrieval-augmented technique selection across 9 everyday software tasks. Disclosure: I'm the author of the retrieval system under test (paperlantern.ai/code); the artifact being shared here is the benchmark suite itself, not the product. Every prompt, agent code path, and prediction file is in the repo and reproducible. Setup. Same coding agent (Claude Opus 4.6 as the planner, Gemini Flash 3 as the task model), same input data, same evaluation scripts across all 9 tasks: test generation (mutation score), text-to-SQL (execution accuracy), PDF extraction, contract extraction, PR review, text classification, few-shot prompt selection, LLM routing, summarization evaluation. Independent variable: whether the agent could call a retrieval tool over CS literature before writing its solution. One pass per task, no retries, no manual filtering of outputs. Task selection. Tasks were chosen to span the everyday-engineering surface a coding agent actually faces, not specialized ML scenarios. Selection criteria: (1) unambiguous quantitative metric, (2) baseline performance well below ceiling, (3) standard datasets where they exist, (4) eval reproducible on a free Gemini API key in roughly 10 minutes per task. Eval methodology. Each task uses its task-standard quantitative metric (mutation score for test_generation, execution accuracy for text_to_sql, F1 on labeled spans for the extraction tasks, weighted F1 for classification, etc.). Full per-task scripts and dataset choices are in the repo - one directory per task, evaluate.py as the entry point, README.md per task documenting methodology and dataset. Retrieval setup. The "with retrieval" agent has access to three tool calls: explore_approaches(problem) returns ranked candidate techniques from the literature, deep_dive(technique) returns implementation steps and known failure modes for a chosen technique, compare_approaches(candidates) is for side-by-side when multiple options look viable. The agent decides when and how often to call them. Latency is roughly 20s per call; results cache across sessions. The baseline agent has none of these tools, otherwise identical scaffolding. Comparability. Both agents share the same task-specific user prompt; the only system-prompt difference is the retrieval agent's tool-call grammar. Predictions and per-task prompts are diffable in the repo (baseline/ and with_pl/ subdirectories per task). Results. Task Baseline With retrieval Delta extraction_contracts 0.444 0.764 +0.320 extraction_schemas 0.318 0.572 +0.254 test_generation 0.625 0.870 +0.245 classification 0.505 0.666 +0.161 few_shot 0.193 0.324 +0.131 code_review 0.351 0.395 +0.044 text_to_sql 0.650 0.690 +0.040 routing 0.744 0.761 +0.017 summeval 0.623 0.633 +0.010 The test-generation delta came from the agent discovering mutation-aware prompting - the techniques are MuTAP and MUTGEN - which enumerate every AST-level mutation of the target and require one test per mutation. Baseline wrote generic tests from pretrain priors. The contract extraction delta came from BEAVER (section-level relevance scoring) and PAVE (post-extraction validation), both 2026 techniques that post-date the agent's training. 10 of the 15 most-cited sources across the experiments were published in 2025 or later, which is the conservative argument for why retrieval matters: the agent could not have reached these techniques from parametric memory. Failure modes. Self-refinement hurt text-to-SQL (the agent second-guessed correct queries after reading work on SQL ambiguity). Two suggested techniques (DyT, SeeDNorm) were architecture-incompatible in the autoresearch experiment and got discarded. Retrieval surfaces better options, not guaranteed wins. Reproducibility. Every prompt, every line of agent code, every prediction file, every eval script is in the repo. Each task directory has a README documenting methodology and an approach.md showing exactly what the retrieval surfaced and which technique the agent chose. Repo: https://github.com/paperlantern-ai/paper-lantern-challenges Writeup with detailed per-task discussion: https://www.paperlantern.ai/blog/coding-agent-benchmarks Happy to share additional design choices in comments. submitted by /u/kalpitdixit [link] [comments]
View originalOpus 4.7 doesn't want to make the change?
I keep running into Claude blocking my prompts for game dev, I found this one funny because the naming for this skill (self-destruct) probably triggers some red flag for malware. Anyone else running into this? submitted by /u/KiriHair [link] [comments]
View originalThe MCP Coding Toolkit Your Agent Desires!
A little over a year ago we released the first version of Serena. What followed was 13 months of hard human work which recently culminated in the first stable release. Today, we present the first evaluation of Serena's impact on coding agents. Evaluation approach Rather than reporting numbers on synthetic benchmarks, we had the agents evaluate the added value of Serena's tools themselves. We designed the methodology to be unbiased and representative, and we've published it in full so you can run an eval on your own projects with your preferred harness. The methodology is described here. Selected results Opus 4.6 (high effort) in Claude Code, large Python codebase: "Serena's IDE-backed semantic tools are the single most impactful addition to my toolkit - cross-file renames, moves, and reference lookups that would cost me 8–12 careful, error-prone steps collapse into one atomic call, and I would absolutely ask any developer I work with to set them up." GPT 5.4 (high) in Codex CLI, Java codebase: "As a coding AI agent, I would ask my owner to add Serena because it gives me the missing IDE-level understanding of symbols, references, and refactorings, turning fragile text surgery into calmer, faster, more confident code changes where semantics matter." What's changed since earlier versions This release of Serena gives coding agents true IDE-level code intelligence - symbol lookup, cross-file reference resolution, and semantic refactorings (including rename, move, inline and propagating deletions). The practical effect is that complex operations that would otherwise require many careful text-based tool calls become single atomic operations, with higher accuracy and lower token usage. Serena's symbolic edit tools are an augmentation of built-in edits that will save tokens on almost every write. No other toolkit or harness currently on the market offers such features. Think of it this way: any serious programmer prefers using an IDE over a text editor, and Serena is the equivalent for your coding agents. If you tried Serena before and were not convinced, we encourage you to give it another look. The most common issues have been addressed, performance and UX have been overhauled. A frequent complaint was that agents didn't remember to use Serena's tools - we've added hooks to solve this. Documentation has been significantly expanded, and setup has been simplified. Join us on Discord. Beyond Raw LSP Many clients offer some level of LSP support, but Serena's LSP integration goes well beyond raw LSP calls. Serena adds substantial logic on top, which is why it took a year to build and why the results differ meaningfully from LSP integrations in other tools. Availability and Pricing The LSP backend is free and fully open-source. The JetBrains backend requires a paid plugin at $5/month - this is our only source of revenue from the project. Background What Serena is not: It is not slopware, a hype project that will die in a few months, a toy or a proof of concept. It's also not backed by a big company, investors or sponsors. This project represents over a year of focused work from my co-developer and me. The many community contributions allowed us to support over 40 programming languages. We have tens of thousands of active users and 23k GitHub stars, but we think Serena is still underknown relative to what it offers. If you work with coding agents, we'd encourage you to try it out! submitted by /u/Left-Orange2267 [link] [comments]
View originalI heard Meta is hosting "Claudonomics" internally, a leaderboard for token usage for coding...
It is leaderboard and gamification system introduced by Meta to track and encourage use AI coding tools among it's employees, specifically using Claude models. It is a clear signal that biggest companies in the world are encouraging AI augmented coding to boost their productivity. Although here is the question, does AI augmented coding really boost productivity, if it does, is it the case for all devs or just Meta-grade super devs? In order to answer this question and truly test and rank productivity of different devs across the globe using Claude Models, particularly Claude Haiku 4.5, I built ClankerRank ( https://clankerrank.xyz). It too maintains a leaderboard, not of users who use Claude the most but who use Claude correctly enough to solve real world production level problems instead. The idea is to see if users make any difference when everyone is using Claude. submitted by /u/Equivalent-Device769 [link] [comments]
View originalA Claude memory retrieval system that actually works (easily) and doesn't burn all my tokens
TL;DR: By talking to claud and explaining my problem, I built a very powerfu local " memory management" system for Claude Desktop that indexes project documents and lets Claude automatically retrieve relevant passages that are buried inside of those documents during Co-Work sessions. for me it solves the "document memory" problem where tools like NotebookLM, Notion, Obsidian, and Google Drive can't be queried programmatically. Claude did all of it. I didn't have to really do anything. The description below includes plenty of things that I don't completely understand myself. the key thing is just to explain to Claude what the problem is ( which I described below) , and what your intention is and claude will help you figure it out. it was very easy to set this up and I think it's better than what i've seen any youtuber recommend The details: I have a really nice solution to the Claude external memory/external brain problem that lots of people are trying to address. Although my system is designed for one guy using his laptop, not a large company with terabytes of data, the general approach I use could be up-scaled just with substitution of different tools. I wanted to create a Claude external memory system that is connected to Claude Co-Work in the desktop app. What I really wanted was for Claude to proactively draw from my entire base of knowledge for each project, not just from the documents I dropped into my project folder in Claude Desktop. Basically, I want Claude to have awareness of everything I have stored on my computer, in the most efficient way possible (Claude can use lots of tokens if you don't manage the "memory" efficiently. ) I've played with Notion and Google Drive as an external brain. I've tried NotebookLM. And I was just beginning to research Obsidian when I read this article, which I liked very much and highly recommend: https://limitededitionjonathan.substack.com/p/stop-calling-it-memory-the-problem That got my attention, so I asked Claude to read the document and give me his feedback based on his understanding of the projects I was trying to work on. Claude recommended using SQLite to connect to structured facts, an optional graph to show some relationships, and .md files for instructions to Claude. But...I pointed out that almost all of the context information I would want to be retrievable from memory is text in documents, not structured data. Claude's response was very helpful. He understood that although SQLite is good at single-point facts, document memory is a different challenge. For documents, the challenge isn't storing them—it's retrieving the right passage when it's relevant without reading everything (which consumes tokens). SQLite can store text, but storing a document in a database row doesn't solve the retrieval problem. You still need to know which row to pull. I asked if NotebookLM from Google might be a better tool for indexing those documents and making them searchable. Claude explained that I was describing is a Retrieval-Augmented Generation (RAG) problem. The standard approach: Documents get chunked into passages (e.g., 500 words each) Each chunk gets converted to an embedding—a vector that captures its meaning When Claude needs context, it converts the query to the same vector format and finds the semantically closest chunks Those chunks get injected into the conversation as context This is what NotebookLM is doing under the hood. It's essentially a hosted, polished RAG system. NotebookLM is genuinely good at what it does—but it has a fundamental problem for my case: It's a UI, not infrastructure. You use it; Claude can't. There's no API, no MCP tool, no way to have Claude programmatically query it during a Co-Work session. It's a parallel system, not an integrated one. So NotebookLM answers "how do I search my documents as a human?"—not "how does Claude retrieve the right document context automatically?" After a little back and forth, here's what we decided to do. For me, a solo operator with only a laptop's worth of documents that need to be searched, Claude proposed a RAG pipeline that looks like this: My documents (DOCX, PDF, XLSX, CSV) ↓ Text extraction (python-docx, pymupdf, openpyxl) ↓ Chunking (split into ~500 word passages, keep metadata: file, folder, date) ↓ Embedding (convert each chunk to a vector representing its meaning) ↓ A local vector database + vector extension (store chunks + vectors locally, single file) ↓ MCP server (exposes a search_knowledge tool to Claude) ↓ Claude Desktop (queries the index when working on my business topics) With that setup, when you're talking to Claude and mention an idea like "did I pay the overdue invoice" or "which projects did Joe Schmoe help with," Claude searches the index, gets the 3-5 most relevant passages back, and uses them in its answer without you doing anything. We decided to develop a search system like that, specific to each of my discrete projects. Th
View originalPricing found: $20 /month, $60 /month, $200 /month, $20, $60
Key features include: Implement, Review, Plan, then execute, Remember what matters, Prompts, enhanced, Commit history, Codebase patterns, External sources.
Augment Code is commonly used for: Automating code reviews to ensure functional correctness and style adherence., Enhancing collaborative coding sessions by providing contextual suggestions., Maintaining a comprehensive understanding of legacy code for easier refactoring., Integrating with CI/CD pipelines to streamline deployment processes., Facilitating knowledge transfer among team members through documented tribal knowledge., Identifying and suggesting improvements based on codebase patterns..
Augment Code integrates with: GitHub, GitLab, Bitbucket, JIRA, Slack, Trello, Visual Studio Code, JetBrains IDEs, CircleCI, Travis CI.
Based on user reviews and social mentions, the most common pain points are: token usage, budget exceeded, token cost, API costs.

Introducing SwipeReview™ by Augment Code
Apr 1, 2026
Based on 31 social mentions analyzed, 23% of sentiment is positive, 77% neutral, and 0% negative.