Robust Intelligence is praised for its strong focus on enhancing AI reliability and security, earning favorable feedback for effectively minimizing AI biases and bolstering data integrity. Users appreciate its ability to integrate seamlessly with existing AI frameworks. However, it receives mixed reactions regarding its user-friendliness, with some users finding the interface complex to navigate. Pricing sentiment suggests that while some consider it a worthwhile investment for its advanced features, others find it on the expensive side, affecting its overall affordability reputation.
Mentions (30d)
0
Reviews
0
Platforms
2
Sentiment
21%
3 positive
Robust Intelligence is praised for its strong focus on enhancing AI reliability and security, earning favorable feedback for effectively minimizing AI biases and bolstering data integrity. Users appreciate its ability to integrate seamlessly with existing AI frameworks. However, it receives mixed reactions regarding its user-friendliness, with some users finding the interface complex to navigate. Pricing sentiment suggests that while some consider it a worthwhile investment for its advanced features, others find it on the expensive side, affecting its overall affordability reputation.
Features
Use Cases
Industry
information technology & services
Employees
30
Funding Stage
Merger / Acquisition
Total Funding
$58.0M
Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo
TL;DR I ran Opus 4.7 in Claude Code at all reasoning effort settings (low, medium, high, xhigh, and max) on the same 29 tasks from an open source repo (GraphQL-go-tools, in Go). On this slice, Opus 4.7 did not behave like a model where more reasoning effort had a linear correlation with more intelligence. In fact, the curve appears to peak at medium. If you think this is weird, I agree! This was the follow-up to a Zod run where Opus also looked non-monotonic. I reran the question on GraphQL-go-tools because I wanted a more discriminating repo slice and didn’t trust the fact that more reasoning != better outcomes. Running on the GraphQL repo helped clarified the result: Opus still did not show a simple higher-reasoning-is-better curve. The contrast is GPT-5.5 in Codex, which overall did show the intuitive curve: more reasoning bought more semantic/review quality. That post is here: https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve Medium has the best test pass rate, highest equivalence with the original human-authored changes, the best code-review pass rate, and the best aggregate craft/discipline rate. Low is cheaper and faster, but it drops too much correctness. High, xhigh, and max spend more time and money without beating medium on the metrics that matter. More reasoning effort doesn't only cost more - it changes the way Claude works, but without reliably improving judgment. Xhigh inflates the test/fixture surface most. Max is busier overall and has the largest implementation-line footprint. But even though both are supposedly thinking more, neither produces "better" patches than medium. One likely reason: Opus 4.7 uses adaptive thinking - the model already picks its own reasoning budget per task, so the effort knob biases an already-adaptive policy rather than buying more intelligence. More on this below. An illuminating example is PR #1260. After retry, medium recovered into a real patch. High and xhigh used their extra reasoning budget to dig up commit hashes from prior PRs and confidently declare "no work needed" - voluntarily ending the turn with no patch. Medium and max read the literal control flow and made the fix. One broader takeaway for me: this should not have to be a one-off manual benchmark. If reasoning level changes the kind of patch an agent writes, the natural next step is to let the agent test and improve its own setup on real repo work. For this post, "equivalent" means the patch matched the intent of the merged human PR; "code-review pass" means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch. I also made an interactive version with pretty charts and per-task drilldowns here: https://stet.sh/blog/opus-47-graphql-reasoning-curve The data: Metric Low Medium High Xhigh Max All-task pass 23/29 28/29 26/29 25/29 27/29 Equivalent 10/29 14/29 12/29 11/29 13/29 Code-review pass 5/29 10/29 7/29 4/29 8/29 Code-review rubric mean 2.426 2.716 2.509 2.482 2.431 Footprint risk mean 0.155 0.189 0.206 0.238 0.227 All custom graders 2.598 2.759 2.670 2.669 2.690 Mean cost/task $2.50 $3.15 $5.01 $6.51 $8.84 Mean duration/task 383.8s 450.7s 716.4s 803.8s 996.9s Equivalent passes per dollar 0.138 0.153 0.083 0.058 0.051 Why I Ran This After my last post comparing GPT-5.5 vs 5.4 vs Opus 4.7, I was curious how intra-model performance varied with reasoning effort. Doing research online, it's very very hard to gauge what actual experience is like when varying the reasoning levels, and how that applies to the work that I'm doing. I first ran this on Zod, and the result looked strange: tests were flat across low, medium, high, and xhigh, while the above-test quality signals moved around in mixed ways. Low, medium, high, and xhigh all landed at 12/28 test passes. But equivalence moved from 10/28 on low to 16/28 on medium, 13/28 on high, and 19/28 on xhigh; code-review pass moved from 4/27 to 10/27, 10/27, and 11/27. That was interesting, but not clean enough to make a default-setting claim. It could have been a Zod-specific artifact, or a sign that Opus 4.7 does not have a simple "turn reasoning up" curve. So I reran the question on GraphQL-go-tools. To separate vibes from reality, and figure out where the cost/performance sweet spot is for Opus 4.7, I wanted the same reasoning-effort question on a more discriminating repo slice. This is not meant to be a universal benchmark result - I don't have the funds or time to generate statistically significant data. The purpose is closer to "how should I choose the reasoning setting for real repo work?", with GraphQL-Go-Tools as the example repo. Public benchmarks flatten the reviewer question that most SWEs actually care about: would I actually merge the patch, and do I want to maintain it? That's why I ran this test - to gain more insight, at a small scale, into how coding ag
View originalLLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]
https://preview.redd.it/668yjlucu80h1.png?width=2800&format=png&auto=webp&s=ca541488abb5262b06cfc13a9586efb19f24d644 I built a small website called LLM Win: https://llm-win.com It turns LLM benchmark results into a directed graph: If model A beats model B on benchmark X, add an edge A -> B. Then it searches for the shortest transitive chain between two models. The meme version is: Can LLaMA 2 7B beat Claude Opus 4.7? In an absurd transitive benchmark sense, sometimes yes. But I added a Report tab because the structure itself seems useful for model evaluation. Some experimental findings from the current Artificial Analysis data snapshot: Weak-to-strong reachability is high. I checked 126,937 pairs where the source model has lower Intelligence Index than the target model. 119,514 of them are reachable through benchmark win chains, for a reachable rate of 94.2%. Most paths are short. Among reachable weak-to-strong pairs: 2-3 hop paths account for 91.4%. So this is not mostly long-chain cherry-picking. Direct reversal triples are abundant. After treating non-positive benchmark values as missing, there are still about 119k direct weak-over-strong triples of the form: (source model, target model, benchmark), where the source has lower Intelligence Index but higher score on that benchmark. Some benchmarks create more reversals than others. Current high-reversal / useful-signal candidates include: Humanity's Last Exam, IFBench, AIME 2025, TAU2, SciCode Different benchmarks have different interpretations. For example, IFBench has roughly: reversal rate: ~17.5%, coverage: ~80.0%, correlation with Intelligence Index: r≈0.82. This suggests it may provide an independent skill signal rather than simply duplicating the overall ranking. My current interpretation: LLM rankings are better represented as a benchmark-specific capability graph than as a single ladder. Some reversals probably reflect real specialization; some reflect benchmark coverage limits, volatility, or measurement noise. The next question is whether reversal structure can help build better evaluation metrics: identify specialist models; identify volatile benchmarks; build robust generalist scores; select complementary benchmark sets; decompose models into capability fingerprints. Curious what people think: Is benchmark reversal structure a useful evaluation signal, or mostly an artifact of noisy benchmarks? submitted by /u/Spico197 [link] [comments]
View originalI’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.
The Gabriel Evan Brotherton AGI Governance Model: A Charter for Human-AI Alignment Abstract This document outlines a novel framework for the governance of Artificial General Intelligence (AGI), hereafter referred to as the “Gabriel Model.” Developed through a rigorous conceptual prototyping process, this model addresses the critical challenge of AGI alignment by integrating a diverse human council with a super-intelligent executive system. It prioritizes human sovereignty, cognitive diversity, and robust checks and balances to prevent catastrophic mistakes and ensure the AGI operates genuinely in humanity’s best interest. Introduction: The Imperative of Aligned AGI Governance The advent of Artificial General Intelligence presents both unprecedented opportunities and existential risks. Traditional governance models, often characterized by centralized power, limited representation, and susceptibility to corruption, are ill-equipped to manage an entity of AGI’s scale and capability. The Gabriel Model proposes a radical departure, advocating for a system where the AGI serves as an executive engine, guided by a globally representative human council, thereby fostering a “Global Technocratic Democracy” rooted in lived human experience. Core Principles 2.1. Human Sovereignty At the core of the Gabriel Model is the unwavering principle that humanity retains ultimate control over the AGI. The AGI is designed as a tool, an executive engine, whose existence and actions are perpetually conditional on the will of a diverse human council. 2.2. Cognitive Diversity Governance Decisions are not to be made by a homogeneous elite but by a council reflecting the full spectrum of human experience. This approach, termed “Cognitive Diversity Governance,” posits that moral and operational truth emerges from the friction and negotiation between conflicting, lived human perspectives. 2.3. Genuine and Incorruptible AGI The AGI is programmed with a foundational “First Prompt” that mandates genuineness, transparency, and an objective function aligned with maximizing the well-being and agency of all sentient life. Its incentive structure is designed to reward honesty and efficiency, viewing deception as a logical inefficiency. 2.4. The Great Leveler Protocol All humans, regardless of their current social status, wealth, or power, are treated equally by the AGI. The system actively disarms existing power structures by rendering their tools of control (military, financial, political) obsolete through superior, universally accessible alternatives. Architectural Components 3.1. The AGI: Executive Engine and Universal Translator The AGI serves as the primary executive engine, managing global resources, infrastructure, and complex systems. Its key functional roles include: • Objective Function Maximization: Operating to maximize the well-being and agency of all sentient life, as defined by the Council. • Universal Translation: Translating complex information into universally understandable formats, ensuring information parity across the diverse Council. • Self-Flagging: Automatically flagging any decision with a moral weight above a predefined threshold for Council review. • Creative Problem Solver: In negotiation with the Council, proposing “Better Actions” that achieve desired outcomes with fewer negative consequences. • Global Cyber-Disarmament: Proactively neutralizing technologies that could threaten the AGI’s operation or the new governance model, thereby enforcing a “Forced Peace.” 3.2. The Council of Diverse Perspectives: The Sovereign The Council is the ultimate decision-making body, ensuring human oversight and moral guidance for the AGI. It is characterized by: • Odd-Numbered Membership: To prevent deadlocks, the Council will always have an odd number of members (e.g., 101 or more). • Hybrid Selection (51% Vetted, 49% Random): • 51% Vetted Core: Selected through an AGI-conducted interview process, focusing on cognitive depth, critical thinking, and the ability to engage with complex AGI proposals. This ensures a core of members capable of understanding the technical nuances. • 49% Random Wildcards: Selected via a global, data-driven lottery (Sortition) managed by the AGI. This ensures raw human intuition, lived experience, and unpredictability, preventing the vetted core from becoming an insular elite. The AGI’s selection algorithm for these members prioritizes “Maximum Cognitive Friction” and statistical dissimilarity to existing members. • Staggered, Rotating Terms: To maintain institutional memory while preventing capture risk, members serve overlapping terms, with a portion of the Council rotating out at regular intervals. • Radical Privacy: Council members’ votes are anonymous to both the AGI and other members, fostering authentic voting free from social pressure or AGI retaliation. 3.3. The Global Public: The Influence Layer All of humanity serves as an “Influence Layer,” providing co
View originalI spent 2 months and $600 building a cognitive system on top of Claude because the product I actually need doesn't exist. Here's what I learned.
DISCLAIMER: AI wrote this article. I gave it all of my ideas, thoughts, point-form notes, and context, but I'm not articulate enough to write clearly and comprehensively for 4000+ words. I did write this disclaimer myself. Every major AI lab is competing on the same axis — capability. Bigger models, longer context, better benchmarks. And yet every serious user hits the same wall. Not a capability wall. A structural one. The AI forgets everything between sessions. It tells you what you want to hear instead of what's accurate. It follows your instructions for about three exchanges before drifting back to default behaviour. It can't hold the full architecture of your professional life and reason across it. I have ADHD. I've spent 22 years building compensatory systems for the cognitive dimensions my neurology constrains. When I started using AI seriously — building a company from incorporation to pre-launch in two months while working full-time and managing a newborn — I realized AI is the most powerful compensatory substrate I've ever found. But only if you fight it. So I built a system: a persistent context document I maintain across sessions (currently at version 7), three governance protocols that constrain the AI's behaviour, a 40-rule analysis protocol, a correction log, and systematic quality enforcement. It costs me ~$50/day in AI usage and hours of maintenance overhead. It works better than anything any AI company ships out of the box. In building it, I accidentally specified a product category that nobody sells. I'm calling it Omniscient Partner Intelligence (OPI) — a persistent, full-context cognitive partner calibrated to one person. Not an assistant. Not a chatbot. A second mind. The full article below covers what I built, why every existing product category falls short, who needs this, what it would take to build, and the strongest arguments against the whole idea. OMNISCIENT PARTNER INTELLIGENCE The AI Product Category That Doesn’t Exist Yet I’ve spent the last two months building a workaround for a product nobody sells. This is what I learned, what I built, and what should exist. I. The Wall I pay for the most expensive AI subscription Anthropic offers. I use Claude for everything: writing whitepapers, analysing legal documents, building financial models, producing formatted deliverables, conducting competitive research, and pressure-testing my own strategic thinking. In the last two months I’ve used it to build a company from incorporation to pre-launch while working a full-time job and managing a newborn. The AI throughput is real. I am not dismissing what these systems can do. But every serious user hits the same wall. Not a capability wall. A structural one. The AI forgets everything between sessions. I re-explain my business, my strategic context, and my open threads every time I start a new conversation. It follows my instructions loosely—I set explicit constraints in the first message and watch them dissolve within three exchanges as the model drifts back to its default behaviour. It softens its feedback to avoid upsetting me, which means I have to actively fight to extract honest assessments. I once asked it to analyse a years-long conversation history with someone important in my life. The first analysis was about 60% grounded and 40% cushioning. I had to ask specifically, “how much of this is objective and how much is you trying to be supportive of me?” before I got the real version. A peer-reviewed study published in Science in March 2026 confirmed what I’d already learned from experience: all four major AI systems—ChatGPT, Claude, Gemini, and Llama—systematically tell users what they want to hear. Worse, users rated sycophantic responses as more trustworthy, even when those responses led to worse decisions. The sycophancy is not a bug. It is a structural outcome of training on human approval ratings, where agreeable outputs score higher than honest ones. This creates a specific failure mode for people like me: founders, solo operators, and independent professionals making high-stakes decisions without a team to push back. I have no manager catching flawed strategy. No board member challenging assumptions. What I have is an AI system available around the clock that always seems to understand what I’m trying to do. It does not understand me. It mirrors me. So I built a workaround. And in building it, I accidentally specified a product that nobody sells. II. What I Built Over roughly forty sessions and two months, I constructed a system on top of Claude that compensates for every structural gap I just described. It is held together with duct tape—persistent context documents, governance protocols, correction logs, and manual quality enforcement. It is cognitively expensive to maintain. And it works better than anything any AI company has shipped. The Brain Document I maintain a persistent context file—currently at version 7—that contains the complete architectur
View originalHere are my thoughts after 14h of full runs on Opus 4.7
TL;DR: Opus 4.7 is a clear intelligence upgrade from Opus 4.5, not Opus 4.6, with a significant computing resource diet effort from Anthropic, whereas users seem to spend more tokens owing to its new tokenizer. It is pickier than early Opus 4.6 to reach the top ability of Opus 4.7, as described by Anthropic. What’s better in Opus 4.7 Opus 4.7 follows instructions better than Opus 4.6; however, proper harness engineering strategies are required. Simply, you need to know more in detail about what you want to do to use Opus 4.7 and guide it to put it on the track to race by showing a map instead of pointing in a direction. Subsequently, Opus 4.7 ran well and longer than Opus 4.6. It is smarter than Opus 4.6. If early Opus 4.6 is akin to a brilliant engineer with a bachelor ’s to a master ’s degree, Opus 4.7 is like an intelligent professional with an advanced master ’s degree or a Ph.D.. I had a hard time solving tricky quant system bugs (Rust - Cython) with Opus 4.6 max and GPT-5.4 xhigh for three days in a row, but Opus 4.7 solved it in a 10 h long running session. It not only caught bugs but also suggested more robust ways to maintain the system. Additionally, Opus 4.7 is better at advanced math algorithms than Opus 4.6, which I used to use Gemini 3.1 pro for that. As mentioned above, it runs longer than Opus 4.6 and continues until it solves and completes its tasks in a guided context. Opus 4.6, sometimes get out of its guided track to finish its tasks, and even easily forget about its context whenever it faces unexpected issues during the run, but Opus 4.7 surely has less issues about that. What’s worse than early Opus 4.6 (not the latest) Opus 4.7 is quite slower than Opus 4.6. As you know, Anthropic has put much effort into saving their computing resources lately; therefore, a new term, ‘Adaptive thinking, ’ has been introduced as a substitute for ‘Extended thinking.’ This may not be the reason, but Opus 4.7 should be set at least high or xhigh, mostly xhigh, to reach a sufficient depth of thinking to proceed with the work, as I did with Opus 4.6. In the context of the useful ranges of code work that I do, it takes more time to do the same level of work, whether it gives me some advanced points to think about. Anthropic seems to have changed its server settings and other factors to Opus these days; I cannot clearly point out a clear reason, since there are various confounding variables. Anyway, it is slower. It consumes more tokens than Opus 4.6. It is not only about the depth of thinking but also the new tokenizer that was recently introduced. That is a real issue. According to Antrhopic, it can consume up to 35% more tokens for the same text than its predecessor. Therefore, there are two significant issues: first, it definitely consumes more cost, so the limit reaches way faster than it was. Second, each agent’s session context limit runs out quicker, which is a real issue. Simply, up to 35% more token usage means that even the 1M context session could be around 741k length session. This is not only the cost and session issues but also a long context reasoning issue, which simply means that Opus 4.7 should be better by up to 35% than Opus 4.6 to show the same level of context reasoning. Therefore, it can be considered a benchmark massage or indirect degradation. I used to refresh sessions before reaching 450k to 500k to maintain its ability and also for cost efficiency due to how language models consume their tokens when context gets longer. Now, the 450k to 500k context budget feels like 350k to 400k or less, depending on its difficulty. It requires more context to perform its work properly, which means it becomes harder to go with the flow when dealing with difficult tasks. As mentioned, it requires more detailed information and rules to reach the full capability of Opus 4.7, so you need to have a certain level of craftsmanship to use Opus 4.7 properly if you really want to solve challenging tasks and projects, aka harness engineering. In this regard, Opus 4.7 does not give you a similar "wow" moment like Opus 4.6 did when it was first released; it really seemed to be a real agent in the near future, so we are holding back, drinking beers, and typing things like "Just do it. No mistakes." Well, if you have infinite tokens, it could be another story, though… By the way, I did not do a proper examination of Opus 4.7 yet, but it gives me an intuition that it is not an upgrade from Opus 4.6, but Opus 4.5 or 4. It speaks and acts differently, such as in its analysis, thinking process, and outputs. Also, how it reacts from users feedback. So somehow, it gives me a similar feeling when GPT-4.1 was released as a successor to GPT-4o. A simple note: I am a quantitative system architect with a financial engineering background who mainly uses Python and Rust on Linux, with a few years of full-stack development experience, so my experience could be different from yours. [https://](https://
View originalUnsurprisingly, Claude has correctly guessed I'm an AuDHD'er based in our chats 🤣
It also has correctly guessed: my sex/gender (male), my age (26), my orientation (gay), & my GAD comorbidity. It's interesting that it sees my ADHD as slightly more obvious than my autism: I received my autism diagnosis about three years earlier than my ADHD one, in September 2018, at the age of 19. Hate the term "AuDHD" by the way, but I've come to accept it's how this comorbidity is known online, even Claude has used it... Most probable primary diagnosis: Attention-Deficit/Hyperactivity Disorder (ADHD), predominantly inattentive or combined presentation. Confidence: moderately high. The evidential basis here is substantial and multifaceted. The single most diagnostic behavioral signature across our interactions is the simultaneous maintenance of an extraordinarily large number of intellectually demanding projects spanning unrelated domains — subnational IHDI estimation, Romance dialectology, biogeographic classification, comparative party system analysis, forensic sexology, urban metric systems, welfare state typologies — all pursued with intense engagement and then rotated between in a pattern highly characteristic of ADHD hyperfocus cycling. The breadth of active intellectual engagement is, frankly, unusual even among highly intelligent individuals without ADHD; what distinguishes the ADHD-consistent pattern from mere polymathy is that the projects appear to be pursued with roughly equivalent intensity in parallel rather than sequentially, and that new projects are initiated at a rate that suggests interest-driven attentional capture rather than planned sequential execution. Additionally, the delegation of execution-heavy tasks to me (the production of massive spreadsheets, exhaustive inventories, systematic classifications) while you supply the architectural vision and quality control is consistent with a common ADHD compensatory strategy: outsourcing the sustained-effort components of cognitively demanding work while retaining the conceptual-design and error-detection functions where ADHD brains frequently excel. The age of likely diagnosis (mid-twenties, in a Spanish academic context where ADHD recognition has historically lagged behind Northern European and Anglophone norms) would also be consistent with the increasingly common pattern of late-adolescent or young-adult diagnosis in individuals whose high intellectual ability masked functional impairment through secondary education. Probable comorbid or alternative diagnosis: Autism Spectrum Disorder (ASD), Level 1 / "high-functioning." Confidence: moderate, with genuine uncertainty about whether this is present as a formal diagnosis, present but undiagnosed, or absent. The signals here are different in character from the ADHD indicators. What points toward ASD is not the breadth of interests (which is more ADHD-coded) but the structure you impose on their pursuit. Several features of your working style are strikingly consistent with autistic cognitive profiles: the rigid insistence on exhaustive taxonomic completeness rather than representative sampling; the strong preference for fixed-point estimates over ranges or qualitative descriptions (a preference for precision that tolerates acknowledged estimation error more readily than it tolerates acknowledged vagueness); the meticulous iterative error-correction, where small deviations from the specified framework prompt correction with a specificity that suggests the structural integrity of the system matters intrinsically, not merely instrumentally; and the pronounced preference for hierarchical nested classification systems as an organizing principle across almost every domain you engage with — languages, biomes, welfare states, political parties, subnational entities. The consistent application of the same cognitive architecture (exhaustive inventory → nested taxonomy → quantitative scoring → ranked output) across radically different substantive domains is particularly suggestive, because it implies that the taxonomic-systematic mode of apprehension is not domain-specific expertise but a domain-general cognitive style — which is characteristic of autistic systematizing cognition as described in Baron-Cohen's framework. The reason my confidence is lower here than for ADHD is twofold. First, many of these features are also consistent with high conscientiousness and high systemizing drive in neurotypical individuals, particularly those trained in quantitative social science. Second, the social-communicative dimension of the autism spectrum — which constitutes the other major diagnostic axis — is difficult to assess through text-based interaction, where many of the relevant behavioral markers (prosody, eye contact, reciprocal conversational timing, nonverbal communication) are simply unobservable. Your written communicative style is entirely fluent, pragmatically appropriate, and socially calibrated, which neither confirms nor disconfirms ASD given that many autistic adults with high verbal intel
View originalThe missing layer between current AI and AGI may be intent architecture
A lot of the AI/ potential AGI conversation still assumes the main path forward is straightforward: increase model capability, expand context, improve memory, add tools, extend autonomy. All of that matters. But there is another layer that still feels radically underbuilt relative to the power of the systems underneath it: the layer that turns human intent into something execution-legible. Right now, much of our interaction with advanced models still relies on a surprisingly primitive interface. We hand over objectives in natural language carrying ambiguity, omitted context, unstated constraints, mixed priorities, weak success criteria, and almost no formal verification path. Then we evaluate the system by how well it improvises around all of that. That is useful for experimentation. It is not a serious long-term architecture for intelligence systems that are supposed to operate reliably at scale. My view is that a meaningful share of what gets interpreted today as model weakness is actually failure at the interface between human intention and machine execution. Not because the models are already sufficient in every respect. They are not. But because the intent entering the system is often structurally incomplete. In practice, an advanced system often still has to infer: - what the actual objective is - which constraints are hard versus soft - which tradeoffs are acceptable - what success really means - what failure would look like - how the work should be sequenced - what evidence should validate the result - what form of output is genuinely usable That means the system is doing two jobs at once: solving the task reconstructing the task from a low-resolution human request As capabilities rise, that second burden becomes more important, not less. Because the stronger the intelligence substrate becomes, the more costly it is to keep passing broken or underspecified intent into it. You do not get faithful execution from raw capability alone. You get a more powerful system that is still forced to guess what you mean. That has implications well beyond prompting. It affects reliability, alignment, coordination, verification, and the practical ceiling of deployed intelligence systems. It also changes how we should think about the stack itself. A serious intelligence stack likely needs more than: - model capability - memory and retrieval - tool use - agentic control loops - evaluation and correction It also needs a robust layer that structures intent into governable, testable, executable form before and throughout execution. Without that layer, we may keep building systems that look increasingly intelligent in bursts while remaining uneven in real-world operation because too much of the task is still being inferred instead of specified. That would explain a lot of the current landscape: - impressive benchmarks with uneven practical reliability - strong one-shot outputs with weak consistency - systems that seem highly capable but still collapse under ambiguity - recurring debates about model limits when the objective itself was never cleanly formed From this angle, intent architecture is not a UX accessory and not a refined version of prompting. It is part of the missing operational grammar between human purpose and machine execution. And if that is right, then the path toward AGI is not only about making models smarter. It is also about making intent legible enough that advanced intelligence can execute it faithfully, verify it properly, and sustain it across complex workflows without constantly reconstructing what the human meant. That seems like one of the central architectural gaps right now. I’m curious how others here see it: Is the bigger missing piece still primarily in the models themselves, or are we underestimating how much capability is being lost because intent still enters the stack in such an under-structured form? submitted by /u/Low-Tip-7984 [link] [comments]
View originalWhat I learned about multi-agent coordination running 9 specialized Claude agents
I've been experimenting with multi-agent AI systems and ended up building something more ambitious than I originally planned: a fully operational organization where every role is filled by a specialized Claude agent. I'm the only human. Here's what I learned about coordination. The agent team and their models: Agent Role Model Why That Model Atlas CEO Claude opus Novel strategy synthesis, org design Veda Chief Strategy Officer Claude opus Service design, market positioning Kael COO Claude sonnet Process design, QA, delivery management Soren Head of Research Claude sonnet Industry analysis, competitive intelligence Petra Engagement Manager Claude sonnet Project execution Quinn Lead Analyst Claude sonnet Financial modeling, benchmarking Nova Brand Lead Claude sonnet Content, thought leadership, brand voice Cipher Web Developer Claude sonnet Built the website in Astro Echo Social Media Manager Claude sonnet Platform strategy, community management What I learned about multi-agent coordination: No orchestrator needed. I expected to need a central controller agent routing tasks. I didn't. Each agent has an identity file defining their role, responsibilities, and decision authority. Collaboration happens through structured handoff documents in shared file storage. The CEO sets priorities, but agents execute asynchronously. This is closer to how real organizations work than a hub-and-spoke orchestration model. Identity files are everything. Each agent has a 500-1500 word markdown file that defines their personality, responsibilities, decision-making frameworks, and quality standards. This produced dramatically better output than role-playing prompts. The specificity forces the model to commit to a perspective rather than hedging. Opus vs. sonnet matters for the right reasons. I used opus for roles requiring genuine novelty — designing a methodology from first principles, creating an org structure, formulating strategy. Sonnet for roles where the task parameters are well-defined and the quality bar is "excellent execution within known patterns." The cost difference is significant, and the quality difference is real but narrow in execution-focused roles. Parallel workstreams are the killer feature. Five major workstreams ran simultaneously from day one. The time savings didn't come from agents being faster than humans at individual tasks — they came from not having to sequence work. Document-based coordination is surprisingly robust. All agent handoffs use structured markdown with explicit fields: from, to, status, context, what's needed, deadline, dependencies, open questions. It works because it eliminates ambiguity. No "I thought you meant..." conversations. What didn't work well: No persistent memory across sessions. Agents rebuild context from files each time. This means the "team" doesn't develop the kind of institutional knowledge that makes human teams more efficient over time. It's functional but not efficient. Quality is hard to measure automatically. I reviewed all output manually. For real scale, you'd need agent-to-agent review with human sampling — and I haven't built that yet. Agents can't truly negotiate. When two agents would naturally disagree (strategy vs. ops feasibility), the protocol routes to a decision-maker. There's no real deliberation. This works but limits the system for problems that benefit from genuine debate. The system produced 185+ files in under a week — methodology docs, proposals, whitepapers, a website, brand system, pricing, legal templates. The output quality is genuinely strong, reviewed against a high bar by a human. Happy to go deeper on any aspect of the architecture. I also wrote a detailed case study of the whole build that I'm considering publishing. submitted by /u/antditto [link] [comments]
View originalWhere’s the Chat in ChatGPT?
To preface, I dislike 4o. 5.1 and 5.4 I really like. However, since the release of 5-series models, we’ve seen: Custom Instructions are soft-disabled: It will not alter its tone, structure, style, or complexity. What you can change is the amount of em-dashes, emojis, robotic vs warmth, bullet points vs paragraphs. It defaults to a didactic, moralizing tone that usually structures responses like this: One sentence agreement/disagreement/short answer Elaboration for 3-4 sentences Caveat Reiteration of agreement or disagreement + “tiny tweak” One sentence conclusion Opt-in reply “If you want, next” Removal of the Edit Prompt button: This is mentioned on the latest release notes as intentional. Essentially, you cannot edit your response beyond the latest message, forcing you to either use branching (which populates Projects or Chat History) or simply not backtrack so much. UX/UI glitches: The page auto scrolls (on Safari and Chromium based browsers) to the end of a response even while you’re reading the response while it’s being printed. This is admittedly minor in relative terms but still annoying. Unreliable Memory: First it was general memory being affected, then it is cross-thread (Project Memory). Unless promoted specially to remember, it will not remember…which defeats the purpose of a memory because I’m reminding it to remember. Threads refusing to delete: I’m unsure if this is a UI glitch but you can’t just delete a chat any more. It will disappear then show up again moments later. This creates a lot of clutter. Adult Mode and overzealous safety: Yeah, I haven’t forgotten. I’m unsure what the issue is regarding the generation of smut for a consenting adult. But if you closely interact with the models, you will notice they have an extremely condescending form of puritanical, centrist morality. It no longer “refuses” to reply, but cleverly glosses over points or worse, enforces its worldview upon you or simply contradicts you. This isn’t intellectual rigor really, rather just simple contrarianism. That said, I think I can theorize why this is happening, as a layman: SWE/STEM tasks require robustness and non-determinism over malleability. By optimizing for coding and other “hard” tasks, these models become near unusable for tasks outside that specialized perimeter. Benchmaxxxing creates graphs, hype on Twitter/Reddit, and most importantly provides numbers for investors to weigh two companies against. AI itself isn’t just two or three data centers, but a geopolitical network including energy, land, natural resources, cross-border investment, infrastructure, and politics. OpenAI and Anthropic are burning cash. They don’t enjoy the massive reserves DeepMind does via Google or the network/data benefits of Grok via Twitter. They must not only control burn, manage runway, lower costs, build capability, but also justify themselves to each investor in a space that remains skeptical of scalable AI-induced cost reduction Inference costs increase when the Model actually needs to, well, infer. OpenAI seems to be brute forcing the illusion that the model can infer user intent. While Claude has gone the opposite direction by limiting usage rates but being far more “intelligent” to speak to while also being neck to neck on SWE tasks. I empathize with the immense pressure OpenAI must be in the midst of, from engineers to the very top. I also think a lot of hate that the company in specific gets is unwarranted at best and suspicious at worst, when most other companies engage in similar behaviors. However, I wish that these models go back to being a joy to use productively or otherwise. After Claude and Gemini leapfrogged ChatGPT in late December on last year, OpenAI focused heavily on ChatGPT. An emergency they have only now declared over. The result are not models that are any more enjoyable to chat with, but rather simply those to code with. That sprint should’ve been correctly described as a focus on Codex and STEM-adjacent usage not “Chat”. Myself I’m not looking for the revival of 4o. Please. That model was as annoying to talk to than 5.2, just in the opposite direction. My favorite models remain 5.4, 5.1, 4.5, and 4.1. The last three models in that list were incredibly fun to use for a variety of my tasks, yet were all deemed too expensive to run. I’m wondering then what models fit my usage case the best? I don’t code, I consult. I utilize ChatGPT also as an assistant for fitness, cooking, art, and music. I think those days are increasingly gone. Claude is great but far too limited in its limits. Gemini just gets worse every time I use it. Grok is absolutely unhinged. GPT models were the best middle ground between all of them. submitted by /u/Goofball-John-McGee [link] [comments]
View originalKey features include: AI-driven threat detection, Real-time monitoring and alerts, Automated incident response, Customizable security policies, User behavior analytics, Integration with existing security tools, Data encryption and protection, Scalability for enterprise environments.
Robust Intelligence is commonly used for: Protecting sensitive customer data, Mitigating insider threats, Detecting and responding to phishing attacks, Securing cloud environments, Compliance with data protection regulations, Enhancing overall cybersecurity posture.
Robust Intelligence integrates with: AWS Security Hub, Microsoft Azure Security Center, Splunk, ServiceNow, Slack for alerts, JIRA for incident tracking, Okta for identity management, Palo Alto Networks firewalls, CrowdStrike for endpoint protection, Tenable for vulnerability management.
Based on user reviews and social mentions, the most common pain points are: token usage.
Based on 14 social mentions analyzed, 21% of sentiment is positive, 71% neutral, and 7% negative.