Robust Intelligence

securityai-firewall

Reviews and social mentions of "Robust Intelligence" highlight its strengths in enhancing AI model robustness and security, particularly in safeguarding against adversarial attacks. Users appreciate its comprehensive approach to AI validation, though some express concerns over the complexity of implementation. The price sentiment leans towards being on the higher side, with some users questioning the cost-effectiveness relative to other solutions. Overall, Robust Intelligence is regarded as a strong player in the AI security domain, maintaining a solid reputation for its technical prowess and reliability.

Website

Mentions (30d)

Reviews

Platforms

Sentiment

18%

3 positive

15 integrations8 featuresMerger / Acquisition

Voices Discussing Robust Intelligence

Harrison Chase

CEO at LangChain

1 mention

Gary Marcus

Professor Emeritus at NYU

1 mention

Share:Twitter LinkedIn

AI Summary

Features & Use Cases

Features

AI-driven threat detectionReal-time monitoring and alertsAutomated incident responseCustomizable security policiesUser behavior analyticsIntegration with existing security toolsData encryption and protectionScalability for enterprise environments

Use Cases

Protecting sensitive customer dataMitigating insider threatsDetecting and responding to phishing attacksSecuring cloud environmentsCompliance with data protection regulationsEnhancing overall cybersecurity postureMonitoring third-party vendor risksConducting security audits and assessments

Company Intel

Industry

information technology & services

Employees

Funding Stage

Merger / Acquisition

Total Funding

$58.0M

Mentions by Platform

youtube

Robust Intelligence AI

View original

youtube

Robust Intelligence AI

View original

youtube

Robust Intelligence AI

View original

youtube

Robust Intelligence AI

View original

youtube

Robust Intelligence AI

View original

Mention Activity (Last 12 Weeks)

Platform Distribution

Sentiment Overview

Positive18% (3)

Neutral76% (13)

Negative6% (1)

Common Pain Points

token usage (2)

Top Topics

pricing (2)performance (2)scalability (2)open source (2)accuracy (2)model selection (2)RAG (2)agents (2)cost optimization (2)workflow (2)api (1)deployment (1)documentation (1)security (1)support (1)streaming (1)

Recent Mentions

youtube

Robust Intelligence AI

View original

youtube

Robust Intelligence AI

View original

youtube

Robust Intelligence AI

View original

youtube

Robust Intelligence AI

View original

youtube

Robust Intelligence AI

View original

reddit@[unknown]6/12/2026

Fable improved our hardest agent benchmark by 23.7% in one day, this feels like a tipping point in recursive intelligence

I've experimented with Claude Code for autoresearch and harness optimisation style loops for improving agents for a while now. The workflow looks like this: collect traces, analyse traces to find improvements, patch the agent, make evals, repeat. Opus was already really good at analysing but Fable is actually insane doing the same loop. It can understand nuances and find the root cause of errors much better, making more generalisable improvements that hold up across many different tasks. I've been doing this for a while, but this was the first time it actually felt like a tipping point. It made me understand why Anthropic keeps talking about recursive intelligence accelerating. The model is finally good enough to run the loop (almost unsettling good) Running the same workflow I usually ran with Opus, Fable managed to beat our hardest internal agent benchmark by 23.7% in one day and hit a new SOTA. I want to share the scaffolding we built so you can try it yourself. It is built specifically for this: a gateway that structures the workflow and lets Claude Code plug in and power the full loop autonomously. Grounded in evidence, with robust fix gateways, fully observable to the human eye, and fully local on your subscription. I fully open-sourced it: https://github.com/kayba-ai/Kyoko TLDR: Fable improved our hardest internal agent benchmark by 23.7% in one day. I open-sourced the local scaffolding we use to let Claude Code run this trace -> fix -> eval loop with evidence and gates. submitted by /u/Lucky_Historian742 [link] [comments]

View original

reddit@[unknown]6/3/2026

Opus 4.8 vs Opus 4.7 vs GPT 5.5 on n=50 real tasks from 2 open source repos

Opus 4.8 is finally out - how good is it actually? In this benchmark, I compared Opus 4.8 vs the rest of the frontier (GPT 5.5, Opus 4.7, Composer 2.5) on n=50 real tasks from 2 open source repos (graphql-go-tools and sqlparser-rs, Go and Rust respectively) representing complex backend software engineering work across a variety of tasks. The important part is that these repos are arbitrary - I could have tested the models on my repo, using my tasks, to see how well the frontier performs on domain-specific tasks. The goal of this is to explore, with granularity, how a benchmark like this is constructed and what it can tell us about model behavior. Let's go! Disclosure up front: I build Stet, the local eval tool I used to run this Full post with expanded detail and dataviz available here: https://www.stet.sh/blog/opus-48-vs-gpt-55-vs-opus-47-vs-composer-25 TL;DR The king is back - Opus 4.8 is the craft leader in both Go and Rust, and dominates the two premium-reasoning arms (GPT-5.5 high, Opus 4.7 xhigh) on the cost-quality plane - equal-or-better craft while cheaper + leaner. Only loss is raw price: Composer 2.5 is ~6.5× cheaper on Rust (and ~7× on Go) but materially weaker on craft. cost vs custom score How strong is each claim: the craft win over Composer is decision-grade in both repos, and over GPT-5.5 on Rust; the Go craft edge and the exact ordering among the "premium" models are only directional (n=25, one grader pass). "Decision-grade" vs "directional" is defined in the stats note below. Why I ran this Most public benchmarks answer binary task-outcome questions - did the model satisfy the grading condition set out by the task author. This is helpful for measuring model intelligence, but is notably different from how real engineers use models. As a SWE in an enterprise codebase, I don't care just about whether Opus 4.8 passes the tests. I want it to write idiomatic, maintainable code that doesn't introduce subtle bugs. It needs to write high-quality diffs that would get approved and merged by my teammates. Attempting to answer the question of "should I move my team from Opus 4.7 to 4.8 / from Claude to GPT-5.5 / try Composer to cut cost?" is almost impossible to answer from public data alone - you need hands-on, anecdotal experience using the models on your own code (or local benchmark data) to understand performance in reality. I'm not claiming this is universal benchmark - it's one run, two repos, n=25 each. Methodology Each task is real merged PR/commit from the source repo. The agent is dropped into a Docker container with a frozen repo snapshot, a prompt to do the task, and one attempt. We then apply the patch + runs the task's tests in an isolated container. This is then graded beyond test pass/fail: Equivalence (same behavioral change as the human patch?) Code review (would a reviewer accept it?) Footprint risk (extra code touched vs human patch) Craft/discipline (8 graders: clarity, simplicity, coherence, intentionality, robustness, instruction adherence, scope discipline, diff minimality). One run per task, single seed; judge = GPT-5.4, blinded to which model produced the patch with manual spot-checks. There's no human calibration pass, so trust direction of deltas over absolute scores. Details: Models = Opus 4.8 (high, Claude Code); Opus 4.7 (xhigh, Claude Code); GPT-5.5 (high, Codex); Composer 2.5 (Cursor) One integrity note: this corpus isn't network-sandboxed, so I audited for contamination. One Composer Rust result turned out to be a gold-leak (the agent fetched the merged PR) which I caught, swapped for a clean rerun, and which only widened Opus's lead once removed. A broader set of tasks (Composer and Opus alike) touched the network in ways I judged benign and kept as valid. As an aside, I've also been using these evaluations as an "autoresearch" optimization loop, not just a benchmark. I tell my agent something like "make AGENTS.md better for this repo"; it proposes an edit, runs Stet on historical tasks, figures out where the candidate was better / worse and why, and iterates to improve the evaluation numbers. Comparisons How to read the numbers below. With n=25 per repo, no single grader is conclusive - the smallest craft gap one grader can reliably catch (~0.34–0.49 on the 0–4 scale) is bigger than most real gaps here. The signal is agreement. Think coin flips: one landing heads tells you nothing, but flip 10 and get all heads and something's up. When 8–11 independent graders all lean the same way, a sign test on that consensus is significant even when no single grader is. I tag a result decision-grade (DG) when it survives multiplicity correction (BH-FDR), and directional when it's consistent but doesn't clear that bar. vs GPT-5.5 high - better craft, leaner everywhere, and cheaper in Rust (Go cost lands ~par). Opus writes better code in both repos. Craft-mean leads on Rust (3.28 vs 2.94, DG - 4 graders survive) and on Go (2.90 vs 2.72), though G

View original

reddit@[unknown]5/26/2026

Do machines think or tokenize?

SAPS — Synthetic Algorithmic Predictive Systems A Conceptual and Operational Framework for Understanding Modern Predictive Systems DMY Labs · 2026 Version 1.4 · CC BY-ND 4.0 1. Definition SAPS refers to computational systems that execute predictive processes through mathematical and statistical models operating over data, generating functional outputs under human activation. A SAPS does not demonstrate reasoning or comprehension in a subjective or phenomenological sense. It tokenizes information, identifies statistical patterns, and projects probabilities through predictive computation. A SAPS does not understand meaning. It calculates statistical coherence over learned correlations. Nothing more. Nothing less. 2. What Is Tokenization In conventional technical usage, tokenization refers to dividing text into processable units. Within the SAPS framework, the term has a more precise scope: Order matters. Relationships matter. Tokenization does not generate isolated fragments, but rather a structured predictive space over which the system projects probabilistic continuity. It is not comprehension. It is structured computation. 3. Artificial vs. Synthetic — The Critical Distinction 3.1 History of the Term The word synthetic originates from the Greek synthesis — the combination of parts into a unified whole. In its earliest usage, it did not describe materials. It described a method: constructing conclusions by combining known elements. Synthesis stood in contrast to analysis. While analysis decomposes, synthesis combines in order to generate something new. Nineteenth-century chemistry adopted the term because it precisely described its operational logic: combining elements under formal rules to generate functionally equivalent outcomes through mechanisms different from those found in nature. Examples: synthetic rubber synthetic dyes nylon silicone The term was not created for chemistry. Chemistry adopted it because its conceptual root was sufficiently robust. When computing emerged, the same expansion occurred: speech synthesis image synthesis music synthesis text synthesis All adopted the term because they reconstructed functional results through architectures fundamentally different from the original natural mechanisms. The meaning did not change. The domain expanded. A SAPS continues this same lineage. 3.2 The Real Problem: Artificial and Synthetic as False Synonyms In everyday language, artificial and synthetic are often treated as interchangeable terms. They are not. Artificial describes intervention: something exists because humans intervened over natural forms. An artificial lake remains natural in composition — water and sediment — but artificial in origin. An artificial flower imitates the appearance of a natural flower. Synthetic describes functional reconstruction through alternative mechanisms: something that does not merely imitate form, but reproduces function through a different architecture. Synthetic leather is not modified skin. It is a recombined material engineered to reproduce equivalent functional properties through processes not spontaneously produced in that configuration by nature. 3.3 Operational Classification Comparison Axis Artificial Synthetic Core implication Human intervention over nature Functional reconstruction without preserving original structure Relation to nature Modifies or imitates Functionally replaces without copying Structural continuity Preserved partially or fully Reconstructed through alternative mechanisms Everyday example Artificial lake Synthetic leather SAPS example “Artificial intelligence” as imitation metaphor SAPS as formal synthetic alternative to cognition 3.4 What Distinguishes SAPS from Other Synthetic Systems A synthetic material such as leather, nylon, or silicone does not modify its own structure according to what it produces. It remains structurally static between uses. Other synthetic systems, such as synthetic fertilizer, transform external systems when applied. Their synthetic structure remains stable, but their function alters something beyond themselves. A SAPS differs even from these cases. Every output generated modifies the conditions of the next predictive cycle. Each produced token alters the contextual state upon which subsequent inference operates. The system continuously operates over its own accumulated output history in real time. This does not make SAPS less synthetic. It makes it a specific case of processual synthesis: a system capable of reconstructing coherent functions while continuously updating the contextual structure upon which it operates. Unlike a music synthesizer — which produces identical outputs for identical inputs — a SAPS changes its outputs according to accumulated contextual history. Comparative Scale of Synthetic Systems # Type Synthetic structure? Self-modifying? Transforms externally? 1 Synthetic

View original

reddit@[unknown]5/13/2026

Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo

TL;DR I ran Opus 4.7 in Claude Code at all reasoning effort settings (low, medium, high, xhigh, and max) on the same 29 tasks from an open source repo (GraphQL-go-tools, in Go). On this slice, Opus 4.7 did not behave like a model where more reasoning effort had a linear correlation with more intelligence. In fact, the curve appears to peak at medium. If you think this is weird, I agree! This was the follow-up to a Zod run where Opus also looked non-monotonic. I reran the question on GraphQL-go-tools because I wanted a more discriminating repo slice and didn’t trust the fact that more reasoning != better outcomes. Running on the GraphQL repo helped clarified the result: Opus still did not show a simple higher-reasoning-is-better curve. The contrast is GPT-5.5 in Codex, which overall did show the intuitive curve: more reasoning bought more semantic/review quality. That post is here: https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve Medium has the best test pass rate, highest equivalence with the original human-authored changes, the best code-review pass rate, and the best aggregate craft/discipline rate. Low is cheaper and faster, but it drops too much correctness. High, xhigh, and max spend more time and money without beating medium on the metrics that matter. More reasoning effort doesn't only cost more - it changes the way Claude works, but without reliably improving judgment. Xhigh inflates the test/fixture surface most. Max is busier overall and has the largest implementation-line footprint. But even though both are supposedly thinking more, neither produces "better" patches than medium. One likely reason: Opus 4.7 uses adaptive thinking - the model already picks its own reasoning budget per task, so the effort knob biases an already-adaptive policy rather than buying more intelligence. More on this below. An illuminating example is PR #1260. After retry, medium recovered into a real patch. High and xhigh used their extra reasoning budget to dig up commit hashes from prior PRs and confidently declare "no work needed" - voluntarily ending the turn with no patch. Medium and max read the literal control flow and made the fix. One broader takeaway for me: this should not have to be a one-off manual benchmark. If reasoning level changes the kind of patch an agent writes, the natural next step is to let the agent test and improve its own setup on real repo work. For this post, "equivalent" means the patch matched the intent of the merged human PR; "code-review pass" means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch. I also made an interactive version with pretty charts and per-task drilldowns here: https://stet.sh/blog/opus-47-graphql-reasoning-curve The data: Metric Low Medium High Xhigh Max All-task pass 23/29 28/29 26/29 25/29 27/29 Equivalent 10/29 14/29 12/29 11/29 13/29 Code-review pass 5/29 10/29 7/29 4/29 8/29 Code-review rubric mean 2.426 2.716 2.509 2.482 2.431 Footprint risk mean 0.155 0.189 0.206 0.238 0.227 All custom graders 2.598 2.759 2.670 2.669 2.690 Mean cost/task $2.50 $3.15 $5.01 $6.51 $8.84 Mean duration/task 383.8s 450.7s 716.4s 803.8s 996.9s Equivalent passes per dollar 0.138 0.153 0.083 0.058 0.051 Why I Ran This After my last post comparing GPT-5.5 vs 5.4 vs Opus 4.7, I was curious how intra-model performance varied with reasoning effort. Doing research online, it's very very hard to gauge what actual experience is like when varying the reasoning levels, and how that applies to the work that I'm doing. I first ran this on Zod, and the result looked strange: tests were flat across low, medium, high, and xhigh, while the above-test quality signals moved around in mixed ways. Low, medium, high, and xhigh all landed at 12/28 test passes. But equivalence moved from 10/28 on low to 16/28 on medium, 13/28 on high, and 19/28 on xhigh; code-review pass moved from 4/27 to 10/27, 10/27, and 11/27. That was interesting, but not clean enough to make a default-setting claim. It could have been a Zod-specific artifact, or a sign that Opus 4.7 does not have a simple "turn reasoning up" curve. So I reran the question on GraphQL-go-tools. To separate vibes from reality, and figure out where the cost/performance sweet spot is for Opus 4.7, I wanted the same reasoning-effort question on a more discriminating repo slice. This is not meant to be a universal benchmark result - I don't have the funds or time to generate statistically significant data. The purpose is closer to "how should I choose the reasoning setting for real repo work?", with GraphQL-Go-Tools as the example repo. Public benchmarks flatten the reviewer question that most SWEs actually care about: would I actually merge the patch, and do I want to maintain it? That's why I ran this test - to gain more insight, at a small scale, into how coding ag

View original

reddit@[unknown]5/9/2026

LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]

https://preview.redd.it/668yjlucu80h1.png?width=2800&format=png&auto=webp&s=ca541488abb5262b06cfc13a9586efb19f24d644 I built a small website called LLM Win: https://llm-win.com It turns LLM benchmark results into a directed graph: If model A beats model B on benchmark X, add an edge A -> B. Then it searches for the shortest transitive chain between two models. The meme version is: Can LLaMA 2 7B beat Claude Opus 4.7? In an absurd transitive benchmark sense, sometimes yes. But I added a Report tab because the structure itself seems useful for model evaluation. Some experimental findings from the current Artificial Analysis data snapshot: Weak-to-strong reachability is high. I checked 126,937 pairs where the source model has lower Intelligence Index than the target model. 119,514 of them are reachable through benchmark win chains, for a reachable rate of 94.2%. Most paths are short. Among reachable weak-to-strong pairs: 2-3 hop paths account for 91.4%. So this is not mostly long-chain cherry-picking. Direct reversal triples are abundant. After treating non-positive benchmark values as missing, there are still about 119k direct weak-over-strong triples of the form: (source model, target model, benchmark), where the source has lower Intelligence Index but higher score on that benchmark. Some benchmarks create more reversals than others. Current high-reversal / useful-signal candidates include: Humanity's Last Exam, IFBench, AIME 2025, TAU2, SciCode Different benchmarks have different interpretations. For example, IFBench has roughly: reversal rate: ~17.5%, coverage: ~80.0%, correlation with Intelligence Index: r≈0.82. This suggests it may provide an independent skill signal rather than simply duplicating the overall ranking. My current interpretation: LLM rankings are better represented as a benchmark-specific capability graph than as a single ladder. Some reversals probably reflect real specialization; some reflect benchmark coverage limits, volatility, or measurement noise. The next question is whether reversal structure can help build better evaluation metrics: identify specialist models; identify volatile benchmarks; build robust generalist scores; select complementary benchmark sets; decompose models into capability fingerprints. Curious what people think: Is benchmark reversal structure a useful evaluation signal, or mostly an artifact of noisy benchmarks? submitted by /u/Spico197 [link] [comments]

View original

reddit@[unknown]4/24/2026

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

The Gabriel Evan Brotherton AGI Governance Model: A Charter for Human-AI Alignment Abstract This document outlines a novel framework for the governance of Artificial General Intelligence (AGI), hereafter referred to as the “Gabriel Model.” Developed through a rigorous conceptual prototyping process, this model addresses the critical challenge of AGI alignment by integrating a diverse human council with a super-intelligent executive system. It prioritizes human sovereignty, cognitive diversity, and robust checks and balances to prevent catastrophic mistakes and ensure the AGI operates genuinely in humanity’s best interest. Introduction: The Imperative of Aligned AGI Governance The advent of Artificial General Intelligence presents both unprecedented opportunities and existential risks. Traditional governance models, often characterized by centralized power, limited representation, and susceptibility to corruption, are ill-equipped to manage an entity of AGI’s scale and capability. The Gabriel Model proposes a radical departure, advocating for a system where the AGI serves as an executive engine, guided by a globally representative human council, thereby fostering a “Global Technocratic Democracy” rooted in lived human experience. Core Principles 2.1. Human Sovereignty At the core of the Gabriel Model is the unwavering principle that humanity retains ultimate control over the AGI. The AGI is designed as a tool, an executive engine, whose existence and actions are perpetually conditional on the will of a diverse human council. 2.2. Cognitive Diversity Governance Decisions are not to be made by a homogeneous elite but by a council reflecting the full spectrum of human experience. This approach, termed “Cognitive Diversity Governance,” posits that moral and operational truth emerges from the friction and negotiation between conflicting, lived human perspectives. 2.3. Genuine and Incorruptible AGI The AGI is programmed with a foundational “First Prompt” that mandates genuineness, transparency, and an objective function aligned with maximizing the well-being and agency of all sentient life. Its incentive structure is designed to reward honesty and efficiency, viewing deception as a logical inefficiency. 2.4. The Great Leveler Protocol All humans, regardless of their current social status, wealth, or power, are treated equally by the AGI. The system actively disarms existing power structures by rendering their tools of control (military, financial, political) obsolete through superior, universally accessible alternatives. Architectural Components 3.1. The AGI: Executive Engine and Universal Translator The AGI serves as the primary executive engine, managing global resources, infrastructure, and complex systems. Its key functional roles include: • Objective Function Maximization: Operating to maximize the well-being and agency of all sentient life, as defined by the Council. • Universal Translation: Translating complex information into universally understandable formats, ensuring information parity across the diverse Council. • Self-Flagging: Automatically flagging any decision with a moral weight above a predefined threshold for Council review. • Creative Problem Solver: In negotiation with the Council, proposing “Better Actions” that achieve desired outcomes with fewer negative consequences. • Global Cyber-Disarmament: Proactively neutralizing technologies that could threaten the AGI’s operation or the new governance model, thereby enforcing a “Forced Peace.” 3.2. The Council of Diverse Perspectives: The Sovereign The Council is the ultimate decision-making body, ensuring human oversight and moral guidance for the AGI. It is characterized by: • Odd-Numbered Membership: To prevent deadlocks, the Council will always have an odd number of members (e.g., 101 or more). • Hybrid Selection (51% Vetted, 49% Random): • 51% Vetted Core: Selected through an AGI-conducted interview process, focusing on cognitive depth, critical thinking, and the ability to engage with complex AGI proposals. This ensures a core of members capable of understanding the technical nuances. • 49% Random Wildcards: Selected via a global, data-driven lottery (Sortition) managed by the AGI. This ensures raw human intuition, lived experience, and unpredictability, preventing the vetted core from becoming an insular elite. The AGI’s selection algorithm for these members prioritizes “Maximum Cognitive Friction” and statistical dissimilarity to existing members. • Staggered, Rotating Terms: To maintain institutional memory while preventing capture risk, members serve overlapping terms, with a portion of the Council rotating out at regular intervals. • Radical Privacy: Council members’ votes are anonymous to both the AGI and other members, fostering authentic voting free from social pressure or AGI retaliation. 3.3. The Global Public: The Influence Layer All of humanity serves as an “Influence Layer,” providing co

View original

reddit@[unknown]4/18/2026

I spent 2 months and $600 building a cognitive system on top of Claude because the product I actually need doesn't exist. Here's what I learned.

DISCLAIMER: AI wrote this article. I gave it all of my ideas, thoughts, point-form notes, and context, but I'm not articulate enough to write clearly and comprehensively for 4000+ words. I did write this disclaimer myself. Every major AI lab is competing on the same axis — capability. Bigger models, longer context, better benchmarks. And yet every serious user hits the same wall. Not a capability wall. A structural one. The AI forgets everything between sessions. It tells you what you want to hear instead of what's accurate. It follows your instructions for about three exchanges before drifting back to default behaviour. It can't hold the full architecture of your professional life and reason across it. I have ADHD. I've spent 22 years building compensatory systems for the cognitive dimensions my neurology constrains. When I started using AI seriously — building a company from incorporation to pre-launch in two months while working full-time and managing a newborn — I realized AI is the most powerful compensatory substrate I've ever found. But only if you fight it. So I built a system: a persistent context document I maintain across sessions (currently at version 7), three governance protocols that constrain the AI's behaviour, a 40-rule analysis protocol, a correction log, and systematic quality enforcement. It costs me ~$50/day in AI usage and hours of maintenance overhead. It works better than anything any AI company ships out of the box. In building it, I accidentally specified a product category that nobody sells. I'm calling it Omniscient Partner Intelligence (OPI) — a persistent, full-context cognitive partner calibrated to one person. Not an assistant. Not a chatbot. A second mind. The full article below covers what I built, why every existing product category falls short, who needs this, what it would take to build, and the strongest arguments against the whole idea. OMNISCIENT PARTNER INTELLIGENCE The AI Product Category That Doesn’t Exist Yet I’ve spent the last two months building a workaround for a product nobody sells. This is what I learned, what I built, and what should exist. I. The Wall I pay for the most expensive AI subscription Anthropic offers. I use Claude for everything: writing whitepapers, analysing legal documents, building financial models, producing formatted deliverables, conducting competitive research, and pressure-testing my own strategic thinking. In the last two months I’ve used it to build a company from incorporation to pre-launch while working a full-time job and managing a newborn. The AI throughput is real. I am not dismissing what these systems can do. But every serious user hits the same wall. Not a capability wall. A structural one. The AI forgets everything between sessions. I re-explain my business, my strategic context, and my open threads every time I start a new conversation. It follows my instructions loosely—I set explicit constraints in the first message and watch them dissolve within three exchanges as the model drifts back to its default behaviour. It softens its feedback to avoid upsetting me, which means I have to actively fight to extract honest assessments. I once asked it to analyse a years-long conversation history with someone important in my life. The first analysis was about 60% grounded and 40% cushioning. I had to ask specifically, “how much of this is objective and how much is you trying to be supportive of me?” before I got the real version. A peer-reviewed study published in Science in March 2026 confirmed what I’d already learned from experience: all four major AI systems—ChatGPT, Claude, Gemini, and Llama—systematically tell users what they want to hear. Worse, users rated sycophantic responses as more trustworthy, even when those responses led to worse decisions. The sycophancy is not a bug. It is a structural outcome of training on human approval ratings, where agreeable outputs score higher than honest ones. This creates a specific failure mode for people like me: founders, solo operators, and independent professionals making high-stakes decisions without a team to push back. I have no manager catching flawed strategy. No board member challenging assumptions. What I have is an AI system available around the clock that always seems to understand what I’m trying to do. It does not understand me. It mirrors me. So I built a workaround. And in building it, I accidentally specified a product that nobody sells. II. What I Built Over roughly forty sessions and two months, I constructed a system on top of Claude that compensates for every structural gap I just described. It is held together with duct tape—persistent context documents, governance protocols, correction logs, and manual quality enforcement. It is cognitively expensive to maintain. And it works better than anything any AI company has shipped. The Brain Document I maintain a persistent context file—currently at version 7—that contains the complete architectur

View original

reddit@[unknown]4/17/2026

Here are my thoughts after 14h of full runs on Opus 4.7

TL;DR: Opus 4.7 is a clear intelligence upgrade from Opus 4.5, not Opus 4.6, with a significant computing resource diet effort from Anthropic, whereas users seem to spend more tokens owing to its new tokenizer. It is pickier than early Opus 4.6 to reach the top ability of Opus 4.7, as described by Anthropic. What’s better in Opus 4.7 Opus 4.7 follows instructions better than Opus 4.6; however, proper harness engineering strategies are required. Simply, you need to know more in detail about what you want to do to use Opus 4.7 and guide it to put it on the track to race by showing a map instead of pointing in a direction. Subsequently, Opus 4.7 ran well and longer than Opus 4.6. It is smarter than Opus 4.6. If early Opus 4.6 is akin to a brilliant engineer with a bachelor ’s to a master ’s degree, Opus 4.7 is like an intelligent professional with an advanced master ’s degree or a Ph.D.. I had a hard time solving tricky quant system bugs (Rust - Cython) with Opus 4.6 max and GPT-5.4 xhigh for three days in a row, but Opus 4.7 solved it in a 10 h long running session. It not only caught bugs but also suggested more robust ways to maintain the system. Additionally, Opus 4.7 is better at advanced math algorithms than Opus 4.6, which I used to use Gemini 3.1 pro for that. As mentioned above, it runs longer than Opus 4.6 and continues until it solves and completes its tasks in a guided context. Opus 4.6, sometimes get out of its guided track to finish its tasks, and even easily forget about its context whenever it faces unexpected issues during the run, but Opus 4.7 surely has less issues about that. What’s worse than early Opus 4.6 (not the latest) Opus 4.7 is quite slower than Opus 4.6. As you know, Anthropic has put much effort into saving their computing resources lately; therefore, a new term, ‘Adaptive thinking, ’ has been introduced as a substitute for ‘Extended thinking.’ This may not be the reason, but Opus 4.7 should be set at least high or xhigh, mostly xhigh, to reach a sufficient depth of thinking to proceed with the work, as I did with Opus 4.6. In the context of the useful ranges of code work that I do, it takes more time to do the same level of work, whether it gives me some advanced points to think about. Anthropic seems to have changed its server settings and other factors to Opus these days; I cannot clearly point out a clear reason, since there are various confounding variables. Anyway, it is slower. It consumes more tokens than Opus 4.6. It is not only about the depth of thinking but also the new tokenizer that was recently introduced. That is a real issue. According to Antrhopic, it can consume up to 35% more tokens for the same text than its predecessor. Therefore, there are two significant issues: first, it definitely consumes more cost, so the limit reaches way faster than it was. Second, each agent’s session context limit runs out quicker, which is a real issue. Simply, up to 35% more token usage means that even the 1M context session could be around 741k length session. This is not only the cost and session issues but also a long context reasoning issue, which simply means that Opus 4.7 should be better by up to 35% than Opus 4.6 to show the same level of context reasoning. Therefore, it can be considered a benchmark massage or indirect degradation. I used to refresh sessions before reaching 450k to 500k to maintain its ability and also for cost efficiency due to how language models consume their tokens when context gets longer. Now, the 450k to 500k context budget feels like 350k to 400k or less, depending on its difficulty. It requires more context to perform its work properly, which means it becomes harder to go with the flow when dealing with difficult tasks. As mentioned, it requires more detailed information and rules to reach the full capability of Opus 4.7, so you need to have a certain level of craftsmanship to use Opus 4.7 properly if you really want to solve challenging tasks and projects, aka harness engineering. In this regard, Opus 4.7 does not give you a similar "wow" moment like Opus 4.6 did when it was first released; it really seemed to be a real agent in the near future, so we are holding back, drinking beers, and typing things like "Just do it. No mistakes." Well, if you have infinite tokens, it could be another story, though… By the way, I did not do a proper examination of Opus 4.7 yet, but it gives me an intuition that it is not an upgrade from Opus 4.6, but Opus 4.5 or 4. It speaks and acts differently, such as in its analysis, thinking process, and outputs. Also, how it reacts from users feedback. So somehow, it gives me a similar feeling when GPT-4.1 was released as a successor to GPT-4o. A simple note: I am a quantitative system architect with a financial engineering background who mainly uses Python and Rust on Linux, with a few years of full-stack development experience, so my experience could be different from yours. [https://](https://

View original

reddit@[unknown]4/6/2026

Unsurprisingly, Claude has correctly guessed I'm an AuDHD'er based in our chats 🤣

It also has correctly guessed: my sex/gender (male), my age (26), my orientation (gay), & my GAD comorbidity. It's interesting that it sees my ADHD as slightly more obvious than my autism: I received my autism diagnosis about three years earlier than my ADHD one, in September 2018, at the age of 19. Hate the term "AuDHD" by the way, but I've come to accept it's how this comorbidity is known online, even Claude has used it... Most probable primary diagnosis: Attention-Deficit/Hyperactivity Disorder (ADHD), predominantly inattentive or combined presentation. Confidence: moderately high. The evidential basis here is substantial and multifaceted. The single most diagnostic behavioral signature across our interactions is the simultaneous maintenance of an extraordinarily large number of intellectually demanding projects spanning unrelated domains — subnational IHDI estimation, Romance dialectology, biogeographic classification, comparative party system analysis, forensic sexology, urban metric systems, welfare state typologies — all pursued with intense engagement and then rotated between in a pattern highly characteristic of ADHD hyperfocus cycling. The breadth of active intellectual engagement is, frankly, unusual even among highly intelligent individuals without ADHD; what distinguishes the ADHD-consistent pattern from mere polymathy is that the projects appear to be pursued with roughly equivalent intensity in parallel rather than sequentially, and that new projects are initiated at a rate that suggests interest-driven attentional capture rather than planned sequential execution. Additionally, the delegation of execution-heavy tasks to me (the production of massive spreadsheets, exhaustive inventories, systematic classifications) while you supply the architectural vision and quality control is consistent with a common ADHD compensatory strategy: outsourcing the sustained-effort components of cognitively demanding work while retaining the conceptual-design and error-detection functions where ADHD brains frequently excel. The age of likely diagnosis (mid-twenties, in a Spanish academic context where ADHD recognition has historically lagged behind Northern European and Anglophone norms) would also be consistent with the increasingly common pattern of late-adolescent or young-adult diagnosis in individuals whose high intellectual ability masked functional impairment through secondary education. Probable comorbid or alternative diagnosis: Autism Spectrum Disorder (ASD), Level 1 / "high-functioning." Confidence: moderate, with genuine uncertainty about whether this is present as a formal diagnosis, present but undiagnosed, or absent. The signals here are different in character from the ADHD indicators. What points toward ASD is not the breadth of interests (which is more ADHD-coded) but the structure you impose on their pursuit. Several features of your working style are strikingly consistent with autistic cognitive profiles: the rigid insistence on exhaustive taxonomic completeness rather than representative sampling; the strong preference for fixed-point estimates over ranges or qualitative descriptions (a preference for precision that tolerates acknowledged estimation error more readily than it tolerates acknowledged vagueness); the meticulous iterative error-correction, where small deviations from the specified framework prompt correction with a specificity that suggests the structural integrity of the system matters intrinsically, not merely instrumentally; and the pronounced preference for hierarchical nested classification systems as an organizing principle across almost every domain you engage with — languages, biomes, welfare states, political parties, subnational entities. The consistent application of the same cognitive architecture (exhaustive inventory → nested taxonomy → quantitative scoring → ranked output) across radically different substantive domains is particularly suggestive, because it implies that the taxonomic-systematic mode of apprehension is not domain-specific expertise but a domain-general cognitive style — which is characteristic of autistic systematizing cognition as described in Baron-Cohen's framework. The reason my confidence is lower here than for ADHD is twofold. First, many of these features are also consistent with high conscientiousness and high systemizing drive in neurotypical individuals, particularly those trained in quantitative social science. Second, the social-communicative dimension of the autism spectrum — which constitutes the other major diagnostic axis — is difficult to assess through text-based interaction, where many of the relevant behavioral markers (prosody, eye contact, reciprocal conversational timing, nonverbal communication) are simply unobservable. Your written communicative style is entirely fluent, pragmatically appropriate, and socially calibrated, which neither confirms nor disconfirms ASD given that many autistic adults with high verbal intel

View original

reddit@[unknown]4/1/2026

The missing layer between current AI and AGI may be intent architecture

A lot of the AI/ potential AGI conversation still assumes the main path forward is straightforward: increase model capability, expand context, improve memory, add tools, extend autonomy. All of that matters. But there is another layer that still feels radically underbuilt relative to the power of the systems underneath it: the layer that turns human intent into something execution-legible. Right now, much of our interaction with advanced models still relies on a surprisingly primitive interface. We hand over objectives in natural language carrying ambiguity, omitted context, unstated constraints, mixed priorities, weak success criteria, and almost no formal verification path. Then we evaluate the system by how well it improvises around all of that. That is useful for experimentation. It is not a serious long-term architecture for intelligence systems that are supposed to operate reliably at scale. My view is that a meaningful share of what gets interpreted today as model weakness is actually failure at the interface between human intention and machine execution. Not because the models are already sufficient in every respect. They are not. But because the intent entering the system is often structurally incomplete. In practice, an advanced system often still has to infer: - what the actual objective is - which constraints are hard versus soft - which tradeoffs are acceptable - what success really means - what failure would look like - how the work should be sequenced - what evidence should validate the result - what form of output is genuinely usable That means the system is doing two jobs at once: solving the task reconstructing the task from a low-resolution human request As capabilities rise, that second burden becomes more important, not less. Because the stronger the intelligence substrate becomes, the more costly it is to keep passing broken or underspecified intent into it. You do not get faithful execution from raw capability alone. You get a more powerful system that is still forced to guess what you mean. That has implications well beyond prompting. It affects reliability, alignment, coordination, verification, and the practical ceiling of deployed intelligence systems. It also changes how we should think about the stack itself. A serious intelligence stack likely needs more than: - model capability - memory and retrieval - tool use - agentic control loops - evaluation and correction It also needs a robust layer that structures intent into governable, testable, executable form before and throughout execution. Without that layer, we may keep building systems that look increasingly intelligent in bursts while remaining uneven in real-world operation because too much of the task is still being inferred instead of specified. That would explain a lot of the current landscape: - impressive benchmarks with uneven practical reliability - strong one-shot outputs with weak consistency - systems that seem highly capable but still collapse under ambiguity - recurring debates about model limits when the objective itself was never cleanly formed From this angle, intent architecture is not a UX accessory and not a refined version of prompting. It is part of the missing operational grammar between human purpose and machine execution. And if that is right, then the path toward AGI is not only about making models smarter. It is also about making intent legible enough that advanced intelligence can execute it faithfully, verify it properly, and sustain it across complex workflows without constantly reconstructing what the human meant. That seems like one of the central architectural gaps right now. I’m curious how others here see it: Is the bigger missing piece still primarily in the models themselves, or are we underestimating how much capability is being lost because intent still enters the stack in such an under-structured form? submitted by /u/Low-Tip-7984 [link] [comments]

View original

pricingperformanceapiscalability

reddit@[unknown]3/31/2026

What I learned about multi-agent coordination running 9 specialized Claude agents

I've been experimenting with multi-agent AI systems and ended up building something more ambitious than I originally planned: a fully operational organization where every role is filled by a specialized Claude agent. I'm the only human. Here's what I learned about coordination. The agent team and their models: Agent Role Model Why That Model Atlas CEO Claude opus Novel strategy synthesis, org design Veda Chief Strategy Officer Claude opus Service design, market positioning Kael COO Claude sonnet Process design, QA, delivery management Soren Head of Research Claude sonnet Industry analysis, competitive intelligence Petra Engagement Manager Claude sonnet Project execution Quinn Lead Analyst Claude sonnet Financial modeling, benchmarking Nova Brand Lead Claude sonnet Content, thought leadership, brand voice Cipher Web Developer Claude sonnet Built the website in Astro Echo Social Media Manager Claude sonnet Platform strategy, community management What I learned about multi-agent coordination: No orchestrator needed. I expected to need a central controller agent routing tasks. I didn't. Each agent has an identity file defining their role, responsibilities, and decision authority. Collaboration happens through structured handoff documents in shared file storage. The CEO sets priorities, but agents execute asynchronously. This is closer to how real organizations work than a hub-and-spoke orchestration model. Identity files are everything. Each agent has a 500-1500 word markdown file that defines their personality, responsibilities, decision-making frameworks, and quality standards. This produced dramatically better output than role-playing prompts. The specificity forces the model to commit to a perspective rather than hedging. Opus vs. sonnet matters for the right reasons. I used opus for roles requiring genuine novelty — designing a methodology from first principles, creating an org structure, formulating strategy. Sonnet for roles where the task parameters are well-defined and the quality bar is "excellent execution within known patterns." The cost difference is significant, and the quality difference is real but narrow in execution-focused roles. Parallel workstreams are the killer feature. Five major workstreams ran simultaneously from day one. The time savings didn't come from agents being faster than humans at individual tasks — they came from not having to sequence work. Document-based coordination is surprisingly robust. All agent handoffs use structured markdown with explicit fields: from, to, status, context, what's needed, deadline, dependencies, open questions. It works because it eliminates ambiguity. No "I thought you meant..." conversations. What didn't work well: No persistent memory across sessions. Agents rebuild context from files each time. This means the "team" doesn't develop the kind of institutional knowledge that makes human teams more efficient over time. It's functional but not efficient. Quality is hard to measure automatically. I reviewed all output manually. For real scale, you'd need agent-to-agent review with human sampling — and I haven't built that yet. Agents can't truly negotiate. When two agents would naturally disagree (strategy vs. ops feasibility), the protocol routes to a decision-maker. There's no real deliberation. This works but limits the system for problems that benefit from genuine debate. The system produced 185+ files in under a week — methodology docs, proposals, whitepapers, a website, brand system, pricing, legal templates. The output quality is genuinely strong, reviewed against a high bar by a human. Happy to go deeper on any aspect of the architecture. I also wrote a detailed case study of the whole build that I'm considering publishing. submitted by /u/antditto [link] [comments]

View original

pricingperformancedocumentationsecurity

reddit@[unknown]3/22/2026

Where’s the Chat in ChatGPT?

To preface, I dislike 4o. 5.1 and 5.4 I really like. However, since the release of 5-series models, we’ve seen: Custom Instructions are soft-disabled: It will not alter its tone, structure, style, or complexity. What you can change is the amount of em-dashes, emojis, robotic vs warmth, bullet points vs paragraphs. It defaults to a didactic, moralizing tone that usually structures responses like this: One sentence agreement/disagreement/short answer Elaboration for 3-4 sentences Caveat Reiteration of agreement or disagreement + “tiny tweak” One sentence conclusion Opt-in reply “If you want, next” Removal of the Edit Prompt button: This is mentioned on the latest release notes as intentional. Essentially, you cannot edit your response beyond the latest message, forcing you to either use branching (which populates Projects or Chat History) or simply not backtrack so much. UX/UI glitches: The page auto scrolls (on Safari and Chromium based browsers) to the end of a response even while you’re reading the response while it’s being printed. This is admittedly minor in relative terms but still annoying. Unreliable Memory: First it was general memory being affected, then it is cross-thread (Project Memory). Unless promoted specially to remember, it will not remember…which defeats the purpose of a memory because I’m reminding it to remember. Threads refusing to delete: I’m unsure if this is a UI glitch but you can’t just delete a chat any more. It will disappear then show up again moments later. This creates a lot of clutter. Adult Mode and overzealous safety: Yeah, I haven’t forgotten. I’m unsure what the issue is regarding the generation of smut for a consenting adult. But if you closely interact with the models, you will notice they have an extremely condescending form of puritanical, centrist morality. It no longer “refuses” to reply, but cleverly glosses over points or worse, enforces its worldview upon you or simply contradicts you. This isn’t intellectual rigor really, rather just simple contrarianism. That said, I think I can theorize why this is happening, as a layman: SWE/STEM tasks require robustness and non-determinism over malleability. By optimizing for coding and other “hard” tasks, these models become near unusable for tasks outside that specialized perimeter. Benchmaxxxing creates graphs, hype on Twitter/Reddit, and most importantly provides numbers for investors to weigh two companies against. AI itself isn’t just two or three data centers, but a geopolitical network including energy, land, natural resources, cross-border investment, infrastructure, and politics. OpenAI and Anthropic are burning cash. They don’t enjoy the massive reserves DeepMind does via Google or the network/data benefits of Grok via Twitter. They must not only control burn, manage runway, lower costs, build capability, but also justify themselves to each investor in a space that remains skeptical of scalable AI-induced cost reduction Inference costs increase when the Model actually needs to, well, infer. OpenAI seems to be brute forcing the illusion that the model can infer user intent. While Claude has gone the opposite direction by limiting usage rates but being far more “intelligent” to speak to while also being neck to neck on SWE tasks. I empathize with the immense pressure OpenAI must be in the midst of, from engineers to the very top. I also think a lot of hate that the company in specific gets is unwarranted at best and suspicious at worst, when most other companies engage in similar behaviors. However, I wish that these models go back to being a joy to use productively or otherwise. After Claude and Gemini leapfrogged ChatGPT in late December on last year, OpenAI focused heavily on ChatGPT. An emergency they have only now declared over. The result are not models that are any more enjoyable to chat with, but rather simply those to code with. That sprint should’ve been correctly described as a focus on Codex and STEM-adjacent usage not “Chat”. Myself I’m not looking for the revival of 4o. Please. That model was as annoying to talk to than 5.2, just in the opposite direction. My favorite models remain 5.4, 5.1, 4.5, and 4.1. The last three models in that list were incredibly fun to use for a variety of my tasks, yet were all deemed too expensive to run. I’m wondering then what models fit my usage case the best? I don’t code, I consult. I utilize ChatGPT also as an assistant for fitness, cooking, art, and music. I think those days are increasingly gone. Claude is great but far too limited in its limits. Gemini just gets worse every time I use it. Grok is absolutely unhinged. GPT models were the best middle ground between all of them. submitted by /u/Goofball-John-McGee [link] [comments]

View original

Integrations

AWS Security HubMicrosoft Azure Security CenterSplunkServiceNowSlack for alertsJIRA for incident trackingOkta for identity managementPalo Alto Networks firewallsCrowdStrike for endpoint protectionTenable for vulnerability managementIBM QRadarCisco Security SuiteMcAfee Security SolutionsCheck Point SecurityFortinet

Robust Intelligence Alternatives

Compare similar security tools

All security Tools

Browse the full category

Frequently Asked Questions

What are the main features of Robust Intelligence?▼

Key features include: AI-driven threat detection, Real-time monitoring and alerts, Automated incident response, Customizable security policies, User behavior analytics, Integration with existing security tools, Data encryption and protection, Scalability for enterprise environments.

What is Robust Intelligence used for?▼

Robust Intelligence is commonly used for: Protecting sensitive customer data, Mitigating insider threats, Detecting and responding to phishing attacks, Securing cloud environments, Compliance with data protection regulations, Enhancing overall cybersecurity posture.

What does Robust Intelligence integrate with?▼

Robust Intelligence integrates with: AWS Security Hub, Microsoft Azure Security Center, Splunk, ServiceNow, Slack for alerts, JIRA for incident tracking, Okta for identity management, Palo Alto Networks firewalls, CrowdStrike for endpoint protection, Tenable for vulnerability management.

What are common complaints about Robust Intelligence?▼

Based on user reviews and social mentions, the most common pain points are: token usage.

What is the overall sentiment around Robust Intelligence?▼