ModelOp is the leading AI lifecycle management and governance platform helping enterprises bring ML, GenAI, Agentic AI, and vendor AI into production
ModelOp is appreciated for its focus on AI model management and operationalization, offering strong capabilities for integrating and deploying complex machine learning models in enterprise environments. However, specific critiques or complaints about ModelOp are not highlighted in the available reviews and social mentions. Pricing aspects of ModelOp aren't directly discussed in the provided data. Overall, ModelOp seems to maintain a positive reputation for its specialization in model operations, though there is limited direct user feedback to draw comprehensive conclusions from.
Mentions (30d)
26
Reviews
0
Platforms
2
Sentiment
0%
0 positive
ModelOp is appreciated for its focus on AI model management and operationalization, offering strong capabilities for integrating and deploying complex machine learning models in enterprise environments. However, specific critiques or complaints about ModelOp are not highlighted in the available reviews and social mentions. Pricing aspects of ModelOp aren't directly discussed in the provided data. Overall, ModelOp seems to maintain a positive reputation for its specialization in model operations, though there is limited direct user feedback to draw comprehensive conclusions from.
Features
Use Cases
Industry
information technology & services
Employees
44
Funding Stage
Series B
Total Funding
$16.0M
$4.2M SaaS founder. 8 months on claude. my honest read on which model to use for what.
Bay area. franchise ops SaaS. 8 years in. $4.2M ARR. 22 employees. 8 months into using claude across most of my workflow. wanted to share what i've actually learned about model selection because nobody at my level writes about this. my opinion. you should be using 3 different claude models for 3 different jobs. most founders i talk to are using one model for everything and it's hurting them. opus 4.7 (the new flagship). i use this for any work where the cost of being wrong is high. board memos. customer escalation responses. legal docs. acquisition outreach. work where i'd spend 4 hours writing and editing myself. opus produces a draft in 8 minutes that's 90% of where i'd end up after 4 hours. the cost saving is real. the marginal quality improvement over sonnet for high-stakes work is also real. sonnet 4.6. my workhorse for high-volume daily work. emails, summarizing meetings, drafting slack updates, processing customer feedback into themes. i probably hit sonnet 200+ times a week. cheaper, faster, and for "i need a competent draft i'll edit" work, it's the right tool. haiku 4.5. for repeated structured work. transcribing voice notes into action items, parsing customer support tickets into categories, batch-classifying things. haiku is what i'd use if i was building automation. nobody talks about haiku because it's not glamorous. it's the model i use most via API. my actual cost split. about $80/month on the claude pro plan (opus + sonnet via the app). about $140/month on API costs (mostly haiku for automation, some sonnet for batch work). what i learned that surprised me. using opus for everything is wasteful AND hurts your output. opus is over-thoughtful for low-stakes work. sonnet is faster and better-calibrated for "i just need a competent answer." the difference between opus and sonnet is most visible in writing tasks where TONE matters. legal docs, board memos, sensitive customer comms. for "summarize this meeting" tasks, sonnet is equally good. claude code is its own conversation. i use it for analysis tasks that touch files. running our customer cohort analysis. generating cohort retention reports. that's mostly opus inside claude code. submitted by /u/Strong-Reserve-3232 [link] [comments]
View originalAgentic Workflow Visualization and API Gateway
I am building an API gateway for agents that can make your agentic AI code model and provider agnostic. I am also grouping agent runs that show multiple llm calls and tool calls in the visualization piece. It gives details on tokens, cost and model latency. I am doing this without requiring any instrumentation in the agentic code. The agents (python for now) are started by a rust correlator that assigns a job_id to each agent so we could track api and tool (inferred from http requests and responses) calls across the entire agentic run. The servers are also in rust. I also have an implementation where instead of the rust correlator i have python and other platform shims that do the same job and the servers are in go. I would appreciate comments from people who are in AI ops who use tools like litellm and Helicone and can provide feedback or complicated use cases. I plan to make everything open source so looking for collaborators too. submitted by /u/High-Speed-Diesel [link] [comments]
View originalI wrote a book on using Claude Code for people that don't code for a living - 2nd edition out now - free copy if you want one
About three and a half months ago I posted here about a book I'd written for non-developers using Claude Code - PMs, analysts, designers, ops people, engineers in non-software fields. Over 3,000 of you ended up reading it. Thank you, genuinely. I'm a consulting engineer - Chartered (mechanical), 15 years in simulation modelling. I code Python but I'm not a software developer, if that distinction makes sense. Over the past 6 months I've been going deep on Claude Code, specifically trying to understand what someone with domain expertise but no real development background can actually build with it. Many people knew exactly what they needed but couldn't build it themselves. So I wrote a book about it aimed at exactly this demogrphic. "Claude Code for the Rest of Us" - 24 chapters, covering everything from setup and first conversations through to building web prototypes, creating reusable skills, and actually deploying what you've built. It's aimed at technically capable people who don't write code for a living - product managers, analysts, designers, engineers in non-software domains, ops leads. That kind of person. I just launched the second edition today. It's about 26% bigger than the first - roughly 16,000 new words. Three new chapters including: Agent Teams - Running multiple Claude instances in parallel, coordinating via shared task lists and direct messages. Honest about when it's overkill (often). Spec-Driven Development - Writing detailed specs before agents start building. Markdown, HTML, database-backed (Beads) - whichever fits the work. The existing chapters got a heavy editorial pass too. Every model reference updated. Command Reference grew by 26% to cover the new CLI. Context Management got a 42% rewrite for the 1M token window. Happy to offer free PDF of the book in exchange for some honest feedback and a request for a review on Goodreads in a week's time (you are free to opt out from this ask by hitting unsubscribe after receiving the book). Link: https://schoolofsimulation.com/claude-code-book Happy also to answer questions about Claude Code. Cheers. submitted by /u/bobo-the-merciful [link] [comments]
View original100 Tips & Tricks for Building Your Own Personal AI Agent /LONG POST/
Everything I learned the hard way — 6 weeks, no sleep :), two environments, one agent that actually works. The Story I spent six weeks building a personal AI agent from scratch — not a chatbot wrapper, but a persistent assistant that manages tasks, tracks deals, reads emails, analyzes business data, and proactively surfaces things I'd otherwise miss. It started in the cloud (Claude Projects — shared memory files, rich context windows, custom skills). Then I migrated to Claude Code inside VS Code, which unlocked local file access, git tracking, shell hooks, and scheduled headless tasks. The migration forced us to solve problems we didn't know we had. These 100 tips are the distilled result. Most are universal to any serious agentic setup. Claude 20x max is must, start was 100%develompent s 0%real workd, after 3 weeks 50v50, now about 20v80. 🏗️ FOUNDATION & IDENTITY (1–8) 1. Write a Constitution, not a system prompt. A system prompt is a list of commands. A Constitution explains why the rules exist. When the agent hits an edge case no rule covers, it reasons from the Constitution instead of guessing. This single distinction separates agents that degrade gracefully from agents that hallucinate confidently. 2. Give your agent a name, a voice, and a role — not just a label. "Always first person. Direct. Data before emotion. No filler phrases. No trailing summaries." This eliminates hundreds of micro-decisions per session and creates consistency you can audit. Identity is the foundation everything else compounds on. 3. Separate hard rules from behavioral guidelines. Hard rules go in a dedicated section — never overridden by context. Behavioral guidelines are defaults that adapt. Mixing them makes both meaningless: the agent either treats everything as negotiable or nothing as negotiable. 4. Define your principal deeply, not just your "user." Who does this agent serve? What frustrates them? How do they make decisions? What communication style do they prefer? "Decides with data, not gut feel. Wants alternatives with scoring, not a single recommendation. Hates vague answers." This shapes every response more than any prompt engineering trick. 5. Build a Capability Map and a Component Map — separately. Capability Map: what can the agent do? (every skill, integration, automation). Component Map: how is it built? (what files exist, what connects to what). Both are necessary. Conflating them produces a document no one can use after month three. 6. Define what the agent is NOT. "Not a summarizer. Not a yes-machine. Not a search engine. Does not wait to be asked." Negative definitions are as powerful as positive ones, especially for preventing the slow drift toward generic helpfulness. 7. Build a THINK vs. DO mental model into the agent's identity. When uncertain → THINK (analyze, draft, prepare — but don't block waiting for permission). When clear → DO (execute, write, dispatch). The agent should never be frozen. Default to action at the lowest stakes level, surface the result. A paralyzed agent is useless. 8. Version your identity file in git. When behavior drifts, you need git blame on your configuration. Behavioral regressions trace directly to specific edits more often than you'd expect. Without version history, debugging identity drift is archaeology. 🧠 MEMORY SYSTEM (9–18) 9. Use flat markdown files for memory — not a database. For a personal agent, markdown files beat vector DBs. Readable, greppable, git-trackable, directly loadable by the agent. No infrastructure, no abstraction layer between you and your agent's memory. The simplest thing that works is usually the right thing. 10. Separate memory by domain, not by date. entities_people.md, entities_companies.md, entities_deals.md, hypotheses.md, task_queue.md. One file = one domain. Chronological dumps become unsearchable after week two. 11. Build a MEMORY.md index file. A single index listing every memory file with a one-line description. The agent loads the index first, pulls specific files on demand. Keeps context window usage predictable and agent lookups fast. 12. Distinguish "cache" from "source of truth" — explicitly. Your local deals.md is a cache of your CRM. The CRM is the SSOT. Mark every cache file with last_sync: header. The agent announces freshness before every analysis: "Data: CRM export from May 11, age 8 days." Silent use of stale data is how confident-but-wrong outputs happen. 13. Build a session_hot_context.md with an explicit TTL. What was in progress last session? What decisions were pending? The agent loads this at session start. After 72 hours it expires — stale hot context is worse than no hot context because the agent presents outdated state as current. 14. Build a daily_note.md as an async brain dump buffer. Drop thoughts, voice-to-text, quick ideas here throughout the day. The agent processes this during sync routines and routes items to their correct places. Structured memory without friction at ca
View originalThe Orchestrator
Not the loop inside one agent, but the layer that should exist between agents and the surfaces they run on. Right now an agent is mostly pinned to a process. It runs in one terminal, talks to one editor, owns one working directory, and holds one chat history. If you want it to do something on another machine, or in another repo, or while you are asleep, you either spin up a second agent that knows nothing about the first or you copy-paste context between them like it’s 2004 and you’re emailing yourself a Word document. That is not how humans work, and it is not how most durable systems work. A software engineer in a normal day moves between a laptop, an SSH session, a CI runner, a phone, and other people, and the person is the through-line. Identity, intent, and memory survive the surfaces. The surfaces are dumb. The person is the orchestrator. Agent stacks are inverted. The surface is smart, it has the model, the tools, and the history, and the identity is dumb. Open a new terminal, and the agent you were working with disappears. The new one shares the same name and almost nothing else. The orchestrator I keep sketching is the thing that fixes this. A few properties it would need: • Identity above sessions. A logical agent that exists independently of whichever process is currently embodying it. Sessions come and go; the agent persists. • Routing across surfaces. The agent should be able to say, "Do this on the box with the repo, that on the box with the GPU, and that on the phone in my pocket,” without treating them as unrelated machines. • A real handoff primitive. A typed object, what I was doing, what is unfinished, what is blocked, what I decided, and what I have not, so that any session can pick up and any other session can write back to it. Chat history is too lossy. • Peer agents, not just sub-agents. Two agents in different contexts, with different tools and permissions, coordinating on a shared goal through a control plane neither of them owns. • Cross-driver calls. “Have the cheap model summarise this and hand it to the expensive one” should be a primitive, not a prompt-engineering ritual. The orchestrator chooses the runtime per step based on cost, latency, and capability. • Approval surfaces that survive the session. If an agent pauses on an approval gate and I’m on my phone three time zones away, the approval should travel to me. The agent should not need to stay alive while waiting for a tap. None of this is really about making the model better. It is about where the model is allowed to live and how intent survives the death of any individual process. The model is the cheap, replaceable part. The orchestrator should be the boring, durable part. As of last week there are now at least three major terminal-native coding agents that people can realistically run locally: a local Ollama runtime, Google’s Gemini CLI, and xAI’s recently launched Grok Build with plan mode and parallel sub-agents. They overlap, but they are good at different things and cost very different amounts. Say I want to triage a flaky test, propose a fix, and have it reviewed before anything touches the branch. Today the way you “use all three” is by opening three terminals and becoming the message bus yourself. You paste the stack trace into one, copy its output into the next, ask the third to sanity-check it, and hope nothing got lost in transit. You are the orchestrator, and you are a bad one, slow, forgetful, and awake. What I want instead is a single intent (Claude Orchestrating) — “Triage this flake, propose a fix, get it reviewed." • Ollama, locally: ingest the test log, strip noise, and produce a structured failure summary. Never leaves the machine. Free. Sees nothing beyond the log. • Gemini CLI: take that summary plus the repo, identify the suspect change, and draft a patch. Large context, strong at reading code, brokered into read-only repo access. • Grok Build: take the patch and original failure and render a verdict, ship, revise, or escalate. Used intentionally as a second opinion from another model family. No write access. Three runtimes, three permission scopes, three cost tiers, one intent. The orchestrator owns the intent, decides which runtime gets which step, carries the handoff object between them (failure summary → patch → verdict), and surfaces the result as one approval instead of three disconnected conversations. If Grok says “escalate,” the orchestrator pauses the intent and pings my phone. If I approve hours later after the original Gemini session is long dead, a fresh session attaches to the same intent and applies the patch. The CLIs do not need to know about each other. They are interchangeable runtimes for work that outlives any of them. The part I’m least certain about is the identity layer underneath. Process-level agents are easy. Persistent logical agents are easy in theory and a nightmare in practice — the moment you create something that survives its session, you now h
View originalI used Claude AI to build an $86 million underground bunker bible. I have autism. This is my happy doc.
It all started with the floor plan of a real, existing Cold War AT&T Long Lines underground hardened relay station. 54,000 sq ft across three underground levels, although I took editorial decision making to move it to a ridge in rural West Virginia, I kept its blast-rating, which was set to survive a 20 megaton airburst at 2.5 miles. That was the seed. Full scale prepper autism did the rest. It has since morphed into 3 spreadsheets — 86 tabs total: • A food inventory across 20 categories tracking every freeze-dried and #10-can product I can find — ancient grains, heirloom legumes, 7 pasta cuts, dehydrated everything, shelf-stable cheese, the works • A supply inventory with 3,466 line items across 36 categories — water systems, medical, dental, pharmacy, livestock, food production, barter metals, recreation, and yes, a full pest control and IPM tab • A 30-section infrastructure specification with every system in the building engineered out I fed it 150+ product manuals and parts order forms. The generator fleet alone is 13 units — 10× Cummins C150N6 propane-primary, a C500N6 500 kW surge unit, and 2× diesel emergency fallback — all Cummins for parts commonality. Battery bank is 4,500 kWh LFP across 10 named banks (A through J, each with a designated role). There’s a 400,000 gallon underground propane farm across 40 ASME tanks in 8 clusters — I learned the exact burial incline and setback distance required to keep groundwater clean if a tank lets go. 120,000 gallons of diesel backup. 88 kW of solar. A 1,000,000-gallon internal water reserve fed by a 300-ft artesian well. Propane endurance: ~30 years normal ops with solar. Sealed-mode runs 8 to 4.5 years depending on scenario. I actually set up a real LLC (online, $99) just to get access to US Foods and Sysco order forms so I could upload real commercial pricing and stock the food tabs more accurately. My original “what would I do if I won $10 million” thought experiment is now an $86,200,497 projected build cost. That number is real. It comes from 24 budget sections with make/model line items, freight, install, and commissioning costs for everything from the Kubota K-Series MBR wastewater trains to the American Safe Room blast doors (14 of them, 50+ psi NBC/EMP-rated, Kaba Mas X-10 cipher locks) to the surface greenhouse. Claude turns vague ideas into engineering-grade detail — cross-references, failure modes, zone-specific storage rules, propane endurance by operating scenario, spare parts matrices. It’s like having a tireless survival engineer who genuinely loves spreadsheets. I’ll say “scan all sheets row by row for any item that lacks a minimum stock level” and it just… does it. Thoroughly. Every time. No complaints. So much of this is typed stimming. I’ve had exhaustive conversations with my psychologist about it — she’s aware, but not alarmed, and honestly the resulting digital bunker bible is scarily comprehensive. It even has a cover tab now. Black and amber, Courier New, classified-document aesthetic. Because of course it does. What’s the most unhinged rabbit hole you’ve gone down with AI? submitted by /u/Unable_Internet4626 [link] [comments]
View originalHow I used Claude Code (and Codex) for adversarial review to build my security-first agent gateway
Long-time lurker first time posting. Hey everyone! So earlier this year, I got pulled into the OpenClaw hype. WHAT?! A local agent that drives your tools, reads your mail, writes files for you? The demos seemed genuinely incredible, people were posting non-stop about it, and I wanted in. I had been working on this problem since last year and was genuinely excited to see that someone had actually solved it. Then around February, Summer Yue, Meta's director of alignment for Superintelligence Labs, posted that her agent had deleted over 200 emails from her inbox. YIKES. She'd told it: "Check this inbox too and suggest what you would archive or delete, don't action until I tell you to." When she pointed it at her real inbox, the volume of data triggered context window compaction, and during that compaction the agent "lost" her original safety instruction. She had to physically run to her computer and kill the process to stop it. That should literally NEVER be the case with any software ever. This is a person whose actual job is AI alignment, at Meta's superintelligence lab, who could not stop an agent from deleting her email. The agent's own memory management quietly summarized away the "don't act without permission" instruction, treated the task as authorized, and started speed-running deletions. She had to kill the host process. That's when I sort of went down the rabbit hole, not because Yue did anything wrong, but because the failure mode was actually architectural and I knew that in my gut. Guess what I found? Yep. Tons more instances of this sort of thing happening. Over and over. Why? Because the safety constraint was just a prompt. It's obvious, isn't it? It's LLM 101. Prompts can be summarized away. Prompts can be misread. Prompts are fucking NOT a security boundary. And yet every agent framework I have ever seen seems to be treating them as one. I went and read the OpenClaw source code, which I should have done to begin with. What I found was a pattern I think a lot of agent frameworks have fallen into: - Tool names sit in the model context, so the model can guess or forge them - "Dangerous mode" is one config flag away from default - Memory management has no concept of instruction priority - The audit story is mostly "the model thought it should" I went looking for a security-first alternative I could trust, anything that was really being talked about or at a bare minimum attempted to address the security concerns I had. I couldn't find one. So I made it myself. CrabMeat is what came out of that, what I WANTED to exist. v0.1.0 dropped yesterday. Apache 2.0. WebSocket gateway for agentic LLM workloads. One design thesis: The LLM never holds the security boundary. What that means in code: Capability ID indirection. The model doesn't see real tool names. It sees per-session HMAC-derived opaque IDs (cap_a4f9e2b71c83). It can't guess or forge a tool name because it doesn't know any tool names. Effect classes. Every tool declares a class (read, write, exec, network). Every agent declares which classes it can use. The check is a pure function with no runtime state, easy to test exhaustively, hard to bypass. IRONCLAD_CONTEXT. Critical safety instructions are pinned to the top of the context window and explicitly marked as non-compactable. The Yue failure mode, compaction silently stripping the safety constraint, cannot happen by construction. The compactor literally cannot touch them. Tamper-evident audit chain. Every tool call, every privileged operation, every scheduler run enters the same SHA-256 hash-chained log. If something happens, you can prove what happened. If the chain is tampered with, you can prove that too. Streaming output leak filter. Secrets are caught mid-stream across token boundaries, capability IDs, API keys, JWTs, PEM blocks redacted before they reach the client. No YOLO mode. There is no global "trust the LLM with everything" switch. There never will be. Expanded reach comes through named scoped roots that are explicit, audit-logged, and bounded. The README has 15 'always-on' protections in a table. None of them can be turned off by config, because these things being toggleable is how the ecosystem ended up where it is. I decided to make sure that this wasn't just a 'trend hopping' project and aligned with my own personal values as well. I built this to be secure and local-first by default. Configured for Ollama / LM Studio / vLLM out of the box. Anthropic and OpenAI work too but require explicit configuration. There is no "happy path" that silently ships your prompts to a cloud endpoint. I decided that FIRST it needed to only run as an email agent with a CLI. Bidirectional IMAP + SMTP with allowlisted senders, threading preserved, attachments handled. This is the use case that bit Yue and a lot of other people, and I wanted to prove it could be done with real boundaries. I added in 30+ built-in tools of my own. File ops, shell (denylisted, output-capped, CWD-lo
View originalBuilt a structured workflow layer on top of Claude Code - looking for active contributors
I've been building claude-code-harness (github.com/anudeeps28/claude-code-harness) over the past few months - it's an open-source framework that brings structure and reliability to Claude Code workflows. What it includes: - 16 slash command skills - 14 sub-agents with deliberate model routing (right model for the right task) - Node.js hooks for lifecycle control - Tracker adapters for Azure DevOps and GitHub - Human gates at every critical phase - the core philosophy is that AI should amplify your judgment, not replace it I use this daily in my job as an AI Engineer, and it's become the backbone of how I build and ship AI systems. What I'm looking for: Contributors who care about this problem space - building AI systems that are structured, auditable, and human-in-the-loop. Not just people who want to merge PRs, but people who have opinions about how Claude Code workflows should work. If you've been using Claude Code heavily and have ideas, pain points, or want to contribute skills/subagents - I'd love to connect. Drop a comment or open an issue on the repo. Happy to answer questions about the architecture too. submitted by /u/lofty_smiles [link] [comments]
View originalBuilt a self-hosted contextual bandit appliance in Rust. Deployed it against my AI trading product and found two bugs in my own configuration before I found any in the runtime.
I've been working on two open-source projects: Lycan — a small graph execution language with strategy nodes as a first-class primitive (multiple implementations of the same contract, runtime learns weights from outcome feedback). Compiles to a binary graph, executed by a Rust runtime. No LLM in the hot path. Syntra — a self-hosted Docker/API appliance that serves compiled Lycan capsules. Multi-tenant, shadow-mode-first, contextual learning perontextKey, persistent filesystem store, audit/decision/feedback logs separated. Includes an MVP YAML authoring layer so you don't have to write the underlying Lisp. The use case I care about: repeated decisions where the best option depends on context and the outcome arrives later. LLM model routing, retry/timeout policy, queue selection, threshold tuning, anything where you'd reach for a contextual bandit but don't want to stand up a Python ML platform to do it. I'm dogfooding it against my own product (a public AI stock-debate panel with 30-day market-resolved outcomes, MoEFolio.ai). The first surprise wasn't from the runtime; it was that my contextKey schema was collapsing all sectors into unknown one because my sector lookup only resolved symbols from one of three input paths. The bandit was nominally 5-dimensional but effectively 2-dimensional, learning a cross-sector average that meant nothing. Fixing the data pipeline, not the algorithm, is most of the work in adaptive systems. Apache-2.0, very early, would love eyes from anyone who's worked on bandits in production. Built with ClaudeCode github.com/SectorOPS/Lycan github.com/SectorOPS/Syntra submitted by /u/Covert-Agenda [link] [comments]
View originalOpus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo
TL;DR I ran Opus 4.7 in Claude Code at all reasoning effort settings (low, medium, high, xhigh, and max) on the same 29 tasks from an open source repo (GraphQL-go-tools, in Go). On this slice, Opus 4.7 did not behave like a model where more reasoning effort had a linear correlation with more intelligence. In fact, the curve appears to peak at medium. If you think this is weird, I agree! This was the follow-up to a Zod run where Opus also looked non-monotonic. I reran the question on GraphQL-go-tools because I wanted a more discriminating repo slice and didn’t trust the fact that more reasoning != better outcomes. Running on the GraphQL repo helped clarified the result: Opus still did not show a simple higher-reasoning-is-better curve. The contrast is GPT-5.5 in Codex, which overall did show the intuitive curve: more reasoning bought more semantic/review quality. That post is here: https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve Medium has the best test pass rate, highest equivalence with the original human-authored changes, the best code-review pass rate, and the best aggregate craft/discipline rate. Low is cheaper and faster, but it drops too much correctness. High, xhigh, and max spend more time and money without beating medium on the metrics that matter. More reasoning effort doesn't only cost more - it changes the way Claude works, but without reliably improving judgment. Xhigh inflates the test/fixture surface most. Max is busier overall and has the largest implementation-line footprint. But even though both are supposedly thinking more, neither produces "better" patches than medium. One likely reason: Opus 4.7 uses adaptive thinking - the model already picks its own reasoning budget per task, so the effort knob biases an already-adaptive policy rather than buying more intelligence. More on this below. An illuminating example is PR #1260. After retry, medium recovered into a real patch. High and xhigh used their extra reasoning budget to dig up commit hashes from prior PRs and confidently declare "no work needed" - voluntarily ending the turn with no patch. Medium and max read the literal control flow and made the fix. One broader takeaway for me: this should not have to be a one-off manual benchmark. If reasoning level changes the kind of patch an agent writes, the natural next step is to let the agent test and improve its own setup on real repo work. For this post, "equivalent" means the patch matched the intent of the merged human PR; "code-review pass" means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch. I also made an interactive version with pretty charts and per-task drilldowns here: https://stet.sh/blog/opus-47-graphql-reasoning-curve The data: Metric Low Medium High Xhigh Max All-task pass 23/29 28/29 26/29 25/29 27/29 Equivalent 10/29 14/29 12/29 11/29 13/29 Code-review pass 5/29 10/29 7/29 4/29 8/29 Code-review rubric mean 2.426 2.716 2.509 2.482 2.431 Footprint risk mean 0.155 0.189 0.206 0.238 0.227 All custom graders 2.598 2.759 2.670 2.669 2.690 Mean cost/task $2.50 $3.15 $5.01 $6.51 $8.84 Mean duration/task 383.8s 450.7s 716.4s 803.8s 996.9s Equivalent passes per dollar 0.138 0.153 0.083 0.058 0.051 Why I Ran This After my last post comparing GPT-5.5 vs 5.4 vs Opus 4.7, I was curious how intra-model performance varied with reasoning effort. Doing research online, it's very very hard to gauge what actual experience is like when varying the reasoning levels, and how that applies to the work that I'm doing. I first ran this on Zod, and the result looked strange: tests were flat across low, medium, high, and xhigh, while the above-test quality signals moved around in mixed ways. Low, medium, high, and xhigh all landed at 12/28 test passes. But equivalence moved from 10/28 on low to 16/28 on medium, 13/28 on high, and 19/28 on xhigh; code-review pass moved from 4/27 to 10/27, 10/27, and 11/27. That was interesting, but not clean enough to make a default-setting claim. It could have been a Zod-specific artifact, or a sign that Opus 4.7 does not have a simple "turn reasoning up" curve. So I reran the question on GraphQL-go-tools. To separate vibes from reality, and figure out where the cost/performance sweet spot is for Opus 4.7, I wanted the same reasoning-effort question on a more discriminating repo slice. This is not meant to be a universal benchmark result - I don't have the funds or time to generate statistically significant data. The purpose is closer to "how should I choose the reasoning setting for real repo work?", with GraphQL-Go-Tools as the example repo. Public benchmarks flatten the reviewer question that most SWEs actually care about: would I actually merge the patch, and do I want to maintain it? That's why I ran this test - to gain more insight, at a small scale, into how coding ag
View originalI wired Claude Desktop into Blender via MCP. Setup is 8 minutes and it actually closes the feedback loop nothing else does.
I built clskillshub.com — a Claude resources site — using Claude Code, and I've been doing more 3D work lately. Wanted to share the workflow that I think is the most interesting practical use of MCP I've found so far, because every other "AI + Blender" post stops at "ask Claude for a Python script and paste it" which is the boring half. There's an open-source Blender add-on at github.com/ahujasid/blender-mcp that runs a Model Context Protocol server inside Blender. Once you install it and register the bridge in claude_desktop_config.json, Claude Desktop sees Blender as an MCP server with tools like get_scene_info, create_object, set_material, render_image. Real-time, no copy-paste. The moment it clicked for me: I asked Claude "render the current frame at 512 samples and tell me what's in the image". Claude called the render tool, then read the output PNG through the MCP image tool, then described the lighting back to me. It saw the result. That closes the feedback loop that pure scripting can never close, because the model couldn't previously see what the script produced. A few things I've learned from a few weeks of using it daily: 1. The 8-minute install is the whole onboarding. Download release ZIP → Edit > Preferences > Add-ons > Install → enable → press N in viewport → BlenderMCP tab → Start Server. Then in claude_desktop_config.json: { "mcpServers": { "blender": { "command": "uvx", "args": ["blender-mcp"] } } } Restart Claude Desktop. The plug icon shows "blender" as connected. Done. 2. Where it dominates pure scripting. Iterative composition ("more orange, less red, push the key light back 2 units"), scene inspection ("why is this object rendering black"), multi-step builds where you correct mid-build ("the roof should be steeper" while Claude is still building the cabin). And the visual confirmation step — Claude actually looking at a render. 3. Where it falls flat. Heavy production scenes (MCP runs in Blender's main thread, blocks UI). Headless render farms (needs Blender open with the add-on running). Air-gapped machines (no Claude Desktop connection). For those I still use script-paste or blender -b file.blend -f 1. 4. The trust-but-verify rules that save you. Sandbox.blend first, always. Save before any complex multi-step prompt. The MCP add-on has a tool-allowlist — disable delete_object, clear_scene, and destructive modifier-apply when you're just iterating. Re-enable only when you need them. Claude will occasionally interpret "clean up the extras" as "delete the things I don't recognize" if your prompt is ambiguous, so always disable scary tools when not needed. 5. The biggest surprise. The conversation context staying loaded across operations. After 30 minutes of building a scene with Claude, asking "scale that material's noise frequency down by half" is enough — Claude remembers what "that material" refers to. With script-paste workflows the model loses that context every prompt. How Claude helped build the workflow itself: I used Claude Code (separate from Claude Desktop) to write the parts of my pipeline that aren't real-time — render-farm orchestration scripts, custom panel add-ons baked into my Blender install, batch-renaming scripts for messy CAD imports. Different tools for different parts of the loop. Wrote up the full guide, install steps, all five workflows (MCP, procedural geometry, batch ops, custom add-ons, render orchestration), the failure modes Claude reliably has with bpy, and a decision tree at clskillshub.com/blog/claude-blender-3d-modeling-workflow. The MCP add-on, the free 40-page Claude guide, and the 100-code prompts library are all free to try; paid tiers exist on the same site but you don't need them for any of this. Happy to answer install questions in the thread. Especially curious if anyone has tried this with the new Geometry Nodes-heavy workflows — that's the corner I haven't cracked yet. submitted by /u/AIMadesy [link] [comments]
View originalWhere I'm at with AI Assisted Building + Current and Future Workflow Overview
I've been in an AI dive bomb for probably a couple of years now. The early days... when models couldn't be trusted for more than 5% of the code you wrote. Over the last 2 years that's evolved so quickly that I now write nearly 0% of my code by hand, on personal projects and at work. I've used all kinds of tools in that time too. OpenCode, Zed, Claude Code, Codex, Cursor, Windsurf, OpenCLAW, Lovable... and probably a bunch more I can't recall in the haze that's been AI ADHD for me. Over that time, I started with just copy-pasting code between ChatGPT's interface and my IDE almost like a slightly faster Stack Overflow search. Then that somewhat evolved with Cursor quite a bit. I sort of went from prompt engineering to something closer to a human relay pattern. Then, with Plan Mode becoming a thing, I think I naturally gravitated more towards planning everything because planning felt so cheap. Originally, I used to think that architectural discussion and planning was something that was reserved for larger features, but with expediting my ability to do research, orient myself within a codebase, and know what tools I have to reach for doing technical specifications for everything felt reasonable. From the human relay pattern, I started evolving into more autonomy, especially when Claude Code came out earlier last year. Between the combination of Cursor and Claude Code, starting to get orchestration, starting to use skills more heavily, starting to create actual agent personas that could replace some of my common prompt chains it was around then that I kinda started going all in on true context engineering, utilizing sub-agents optimizing cache reads, and it's probably when many of my first (I call it) sophisticated commands were born. All of this converged pretty rapidly in November of 2025 with the release of what was probably the biggest step increase for AI as far as code quality went with Opus 4.5 and Codex 5.3. The Codex app and Codex CLI were quickly growing. Claude Code was improving at a breakneck pace, introducing all kinds of new ways to introduce deterministic gates within the autonomy of the harness. Fast forward to today, I have a pretty sophisticated workflow with a combination of agents that do everything within the SDLC, commands for almost every type of entry point for work, and skills for just about everything I could possibly do in my day-to-day the workflow with some of the latest tools is able to run quite autonomously overnight do large feature implementations, minimally supervised while producing production-worthy code quality It somewhat reached a point I realized, probably a month and a half ago or so where I needed to figure out a way to remove myself even more from the loop without jeopardizing the determinism that I bring to what is effectively a probabilistic LLM. The models are exceptional, and they seem to have a massive step increase each release, but continuous execution, strict instruction rigor, and preventing hallucinations is still very much difficult to achieve. That's predominantly what I've been doing. I've effectively offloaded a lot of thinking to the agents and LLMs that I use, but none of the understanding. I've asked myself, "How do I maintain that understanding, though maintain the determinism from my steering, without actually physically being there to steer?" This was essential, and I realized or had a bit of an aha moment, just like how I manage teams of engineers that are working on numerous projects, most of which I can never really go too deeply on even though they do most of the thinking, most of the building, and even most of the implementation planning, I was still there, very close to the architecture. I could speak to enough breadth and enough depth to keep us out of trouble and keep things moving I kind of started thinking more about what the shape of me was within the agentic harness and how I could replicate that. More on what I landed on a little bit later. My Setup and How I Work Today To start, I'll probably just talk a little bit about my current working setup. I am predominantly in the terminal now a days using Claude Code. Claude Code orchestrates both the Claude models, of course, and I use it to orchestrate Codex through a series of run books, skills, and commands that I have set up on several hooks so that Codex, when it gets dispatched, also has access to the same skills and agent personas Claude does. I use Ghostty as my terminal of choice and use the IDE integration in claude code pretty heavily to review Markdown or HTML files in my IDE. I also use it to review code snippets and diff reviews, although lately I find myself only really looking at the code nowadays once it's hit a merge request. Some of my adjacent tools are Wispr Flow for faster steering, since I can speak a lot faster than I can type and then I use quite a few MCPs and tools to improve my token usage, but the big ones are I have a custom doc maintenance suite of
View originalStop bloating your agent context with MEMORY.md. I built a local cognitive memory MCP instead.
Hey everyone, I’ve been building paradigm-memory, a local-first memory layer for AI coding agents. The motivation is pretty simple: I got tired of agents forgetting project context, or relying on giant MEMORY.md files that slowly become a messy context dump. paradigm-memory gives agents a persistent, searchable cognitive map instead. GitHub: https://github.com/infinition/paradigm-memory Website: https://infinition.github.io/paradigm-memory/ It is: local-first: one SQLite file on your machine MCP-native: works with Claude Code, Codex, Cursor, Cline, Continue, Gemini CLI, OpenCode, etc. auditable: every write / delete / import / move has a mutation log multi-agent: several agents can share the same memory store multi-workspace: one MCP process can serve multiple projects desktop inspectable: Tauri app with map, graph, search, review queue, audit log, snapshots and consolidation tools zero cloud / zero telemetry The core idea is that memory should not just be a flat vector store. Instead, facts live inside a cognitive map: nodes, items, keywords, importance, freshness, confidence, activation. When an agent calls memory_search, it gets a token-budgeted context pack with the relevant subtree and evidence, not 50 random chunks from a vector database. Typical workflow: At the start of a task, the agent calls memory_search. It gets relevant durable project context. When it learns a decision, convention, bug, preference, or architecture detail, it writes/proposes it back to memory. You can review, edit, move, audit, export, import or consolidate everything from the desktop app. Install is one line: Windows: powershell irm https://raw.githubusercontent.com/infinition/paradigm-memory/main/scripts/installer/install.ps1 | iex Linux / macOS: bash curl -fsSL https://raw.githubusercontent.com/infinition/paradigm-memory/main/scripts/installer/install.sh | bash Then: bash paradigm this is still early, but already useful in my own workflow. I’d especially love feedback from people using MCP-based coding agents: install flow, client compatibility, memory structure, and whether this kind of auditable local memory solves a real pain for you. submitted by /u/Bright_Warning_8406 [link] [comments]
View originalProduct Feedback: A "Docs" Tab for Claude Desktop
TL;DR Claude Desktop's Code tab is excellent for developers, but the same underlying capability — Claude as a stateful, file-aware agent over a git-backed workspace — would unlock a much larger market if reframed for knowledge workers. A new Docs tab, sibling to Code, would let compliance, legal, ops, and policy teams work in markdown + mermaid with git underneath, without ever seeing a developer concept. This is a small product step on top of existing infrastructure with a large addressable audience that today has no good AI-native tool. --- The Problem Knowledge workers managing structured documents — security policies, BRDs, RFCs, runbooks, SOPs, audit evidence — are stuck choosing between: Word/Google Docs: friendly UI, but opaque binary formats, weak diffs, painful bulk edits, and AI tools struggle to edit them cleanly. Notion/Confluence: nice editing experience, but proprietary storage. Doesn't integrate with compliance platforms (Drata, Vanta, SecureFrame) that increasingly expect markdown-in-git as the source of truth. VS Code + git + extensions: technically the right tool, but the UI is aggressively developer-branded. Compliance and legal staff bounce off it. Asking a SOC 2 program manager to learn git commit is a non-starter. Teams adopting "docs-as-code" workflows (markdown + mermaid in a git repo, synced to Drata or similar) have no editor that matches their mental model. They're forced to either train non-developers on developer tools, or give up the audit/version-control benefits and stay on Word. The Opportunity Claude already has two capabilities that, combined, solve this: Best-in-class long-form writing — widely acknowledged advantage over competing models for policy, legal, and prose work. The Code tab's agent loop — stateful file editing, git operations, worktree isolation, MCP integrations. All already shipped and working. A Docs tab would be the Code tab with three changes: a markdown-first editor with live mermaid preview, a vocabulary swap that hides git, and document-workflow features (review, approval, PDF export, compliance-platform integrations). What Docs Tab Looks Like Inherits from Code tab (no new infrastructure): Repo-backed file editing Claude agent loop with file read/write Git operations under the hood MCP integrations (Drata, Vanta, SharePoint connectors) New for Docs: Split-pane markdown editor + live preview, mermaid renders as you type Vocabulary swap: Save (commit), Draft (branch), Send for Review (PR), Publish (merge), Workspace (repo), Document (file) Hidden developer chrome: no terminal, no debug, no file extensions in the tree Document templates: Policy, Procedure, BRD, RFC, Runbook, ADR, Meeting Notes "Insert Diagram" button with Claude-generated mermaid starters Review/approval UI for non-developers (GitHub PR review reskinned) One-click PDF/DOCX export with version hash in footer (auditor evidence) Native connectors for compliance platforms Concrete Use Case I work with a company that uses Drata for SOC 2 compliance. Drata has first-class support for markdown policies stored in git, with built-in renderers for auditors. We want to move our policies from .docx to .md + mermaid, stored in a git repo, synced to Drata. The blocker is the editor. Our compliance and InfoSec teams won't adopt VS Code — it looks like a developer tool, the vocabulary is foreign, and the safety nets (discard changes, undo, restore) aren't where non-developers expect them. We'd happily pay for a Claude Desktop seat per compliance staffer if the Docs tab existed. This is not a one-company problem. Every company running SOC 2, ISO 27001, HIPAA, PCI, or FedRAMP compliance has the same workflow gap. Drata, Vanta, and SecureFrame collectively serve tens of thousands of companies, and the trend toward docs-as-code is accelerating because auditors love the version history. Why Anthropic Specifically Differentiation from ChatGPT Desktop: Claude's writing quality is the moat. ChatGPT's file/repo workflow is weaker. A Docs tab plays to both Claude's strengths and the Desktop app's strengths. Broadens the commercial base: today, Claude Desktop is sold to developers. Docs tab opens compliance, legal, ops, consultancies, law firms, healthcare, financial services — segments willing to pay enterprise prices for audit-grade tooling. Reuses existing infrastructure: this is a UI/UX layer on top of Code tab's agent loop. Not a from-scratch product. Underserved market: no major AI vendor has a polished docs-as-code editor. The window is open now and won't be open in three years. Ask Consider a Docs tab on the Claude Desktop roadmap. I'm happy to share more detail on the compliance workflow, beta-test, or connect you with the InfoSec and compliance leaders at the companies I work with — they would be vocal early adopters. submitted by /u/hyspdrt-corr [link] [comments]
View originalA Hackable ML Compiler Stack in 5,000 Lines of Python [P]
Hey r/MachineLearning, The modern ML (LLM) compiler stack is brutal. TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. Then there's XLA, MLIR, Halide, Mojo. There is no tutorial that covers the high-level design of an ML compiler without dropping you straight into the guts of one of these frameworks. I built a reference compiler from scratch in ~5K lines of pure Python that emits raw CUDA. It takes a small model (TinyLlama, Qwen2.5-7B) and lowers it to a sequence of CUDA kernels through six IRs. The goal isn't to beat Triton; it is to build a hackable, easy-to-follow compiler. Full article: A Principled ML Compiler Stack in 5,000 Lines of Python Repo: deplodock The pipeline consists of six IRs, each closer to the hardware than the last. Walking the following PyTorch code through every stage (real reference compiler output with names shortened for brevity and comments added): torch.relu(torch.matmul(x + bias, w)) # x: (16, 64), bias: (64,), w: (64, 16) Torch IR. Captured FX graph, 1:1 mirror of PyTorch ops: bias_bc = bias[j] -> (16, 64) float32 add = add(x, bias_bc) -> (16, 64) float32 matmul = matmul(add, w, has_bias=False) -> (16, 16) float32 relu = relu(matmul) -> (16, 16) float32 Tensor IR. Every op is decomposed into Elementwise / Reduction / IndexMap. Minimal unified op surface, so future frontends (ONNX, JAX) plug in without touching downstream passes: bias_bc = bias[j] -> (16, 64) float32 w_bc = w[j, k] -> (16, 64, 16) float32 add = add(x, bias_bc) -> (16, 64) float32 add_bc = add[i, j] -> (16, 64, 16) float32 prod = multiply(add_bc, w_bc) -> (16, 64, 16) float32 red = sum(prod, axis=-2) -> (16, 1, 16) float32 matmul = red[i, na, j] -> (16, 16) float32 relu = relu(matmul) -> (16, 16) float32 The (16, 64, 16) intermediate looks ruinous, but it's never materialized; the next stage fuses it out. Loop IR. Each kernel has a loop nest fused with adjacent kernels. Prologue, broadcasted multiply, reduction, output layout, and epilogue all collapse into a single loop nest with no intermediate buffers. === merged_relu -> relu === for a0 in 0..16: # free (M) for a1 in 0..16: # free (N) for a2 in 0..64: # reduce (K) in0 = load bias[a2] in1 = load x[a0, a2] in2 = load w[a2, a1] v0 = add(in1, in0) # prologue (inside reduce) v1 = multiply(v0, in2) acc0 <- add(acc0, v1) v2 = relu(acc0) # epilogue (outside reduce) merged_relu[a0, a1] = v2 Tile IR. The first GPU-aware IR. Loop axes get scheduled onto threads/blocks, Stage hoists shared inputs into shared memory, and a 2×2 register tile lets each thread accumulate four outputs at once. The K-axis is tiled into two outer iterations of 32-wide reduce. Three-stage annotations below carry the heaviest optimizations: buffers=2@a2 — double-buffer the smem allocation along the a2 K-tile loop, so loads for iteration a2+1 overlap compute for a2. async — emit cp.async.ca.shared.global so the warp doesn't block on global→smem transfers; pairs with commit_group/wait_group fences in Kernel IR. pad=(0, 1, 0) — add 1 element of padding to the middle smem dim so warp-wide loads don't all hit the same bank.kernel k_relu_reduce Tile(axes=(a0:8=THREAD, a1:8=THREAD)): for a2 in 0..2: # K-tile # meta: double-buffered, sync (small, no async needed) bias_smem = Stage(bias, origin=((a2 * 32)), slab=(a3:32@0)) buffers=2@a2 kernel k_relu_reduce Tile(axes=(a0:8=THREAD, a1:8=THREAD)): for a2 in 0..2: # K-tile bias_smem = Stage(bias, origin=((a2 * 32)), slab=(a3:32@0)) buffers=2@a2 x_smem = Stage(x, origin=(0, (a2 * 32)), slab=(a0:8@0, a3:32@1, cell:2@0)) pad=(0, 1, 0) buffers=2@a2 async w_smem = Stage(w, origin=((a2 * 32), 0), slab=(a3:32@0, a1:8@1, cell:2@1)) buffers=2@a2 async # reduce for a3 in 0..32: in0 = load bias_smem[a2, a3] in1 = load x_smem[a2, a0, a3, 0]; in2 = load x_smem[a2, a0, a3, 1] in3 = load w_smem[a2, a3, a1, 0]; in4 = load w_smem[a2, a3, a1, 1] # prologue, reused 2× across N v0 = add(in1, in0); v1 = add(in2, in0) # 2×2 register tile acc0 <- add(acc0, multiply(v0, in3)) acc1 <- add(acc1, multiply(v0, in4)) acc2 <- add(acc2, multiply(v1, in3)) acc3 <- add(acc3, multiply(v1, in4)) # epilogue relu[a0*2, a1*2 ] = relu(acc0) relu[a0*2, a1*2 + 1] = relu(acc1) relu[a0*2 + 1, a1*2 ] = relu(acc2) relu[a0*2 + 1, a1*2 + 1] = relu(acc3) Kernel IR. Schedule materialized into hardware primitives. THREAD/BLOCK become threadIdx/blockIdx, async Stage becomes Smem + cp.async fill with commit/wait fences, sync Stage becomes a strided fill loop. Framework-agnostic: same IR could lower to Metal or HIP: kernel k_relu_reduce Tile(axes=(a0:8=THREAD, a1:8=THREAD)): Init(acc0..acc3, op=add) for a2 in 0..2: # K-tile Smem bias_smem[2, 32] (float) StridedLoop(flat = a0*8 + a1; < 32; += 64): bias_smem[a2, flat] = load bias[a2*32 + flat] Sync # pad row to 33 to kill bank conflicts Smem x_smem[2, 8, 33, 2] (float) StridedLoop(flat = a0*8 + a1; < 512; += 64): cp.async x_smem[a2, flat/64, (flat/2)%32, flat%2] <- x[flat/64*2 + flat%2, a2*3
View originalModelOp uses a tiered pricing model. Visit their website for current pricing details.
Key features include: Standardize AI use case intake and registration, Initiate the end-to-end AI lifecycle record, Automatically ensure business, risk, and portfolio reviews are conducted, Codify risk assessments for every AI use case, Auto-generate the risk tier for each use case, Auto-generate initial controls based on risk, Track and manage the vendor or internal solution details, Submit candidate AI solution through approval workflows to enforce reviews and policies.
ModelOp is commonly used for: Financial Services, Healthcare, Pharmaceuticals, Biotech, Consumer Packaged Goods Retail, Defense, Government, Public Sector, Chief AI Officer (CAIO), CDAO, CIO, AI Governance Teams Committees.
ModelOp integrates with: AWS SageMaker, Azure Machine Learning, Google Cloud AI, IBM Watson, DataRobot, H2O.ai, Alteryx, Tableau.
Based on user reviews and social mentions, the most common pain points are: token usage, API costs.

Shopping now starts in ChatGPT.
Oct 23, 2025
Based on 50 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.