Designing everyday AGI.
Users generally appreciate MultiOn for its versatility in facilitating multi-agent execution and its ability to handle structured work efficiently under governance rules. However, some users express concerns about potential conflicts or data overwriting when multiple agents engage simultaneously. The pricing sentiment is mixed, as some value the capabilities provided, while others find it challenging to justify the cost. Overall, MultiOn is seen as a robust tool with a good reputation among those needing structured AI management solutions, but it may require improvements in conflict resolution and cost transparency.
Mentions (30d)
102
20 this week
Reviews
0
Platforms
2
Sentiment
2%
3 positive
Users generally appreciate MultiOn for its versatility in facilitating multi-agent execution and its ability to handle structured work efficiently under governance rules. However, some users express concerns about potential conflicts or data overwriting when multiple agents engage simultaneously. The pricing sentiment is mixed, as some value the capabilities provided, while others find it challenging to justify the cost. Overall, MultiOn is seen as a robust tool with a good reputation among those needing structured AI management solutions, but it may require improvements in conflict resolution and cost transparency.
Features
Use Cases
Industry
information technology & services
Employees
47
Funding Stage
Seed
Total Funding
$20.0M
eTPS — Effective Tokens Per Second: A Better Way to Measure Local LLM Performance
# [](https://www.reddit.com/r/ArtificialInteligence/?f=flair_name%3A%22%F0%9F%9B%A0%EF%B8%8F%20Project%20%2F%20Build%22)We're obsessed with raw tokens per second. Every hardware post leads with it. Every quantization comparison is ranked by it. It's the one number everyone agrees to report. It's also measuring the wrong thing. Raw TPS tells you how fast tokens hit the screen. It tells you almost nothing about how quickly you get a correct, usable answer. On sustained, multi-turn workflows, that gap becomes massive. A faster model that hallucinates, requires multiple corrections, and forgets context you gave it earlier can easily be less useful than a slower model that gets it right the first time. **eTPS (Effective Tokens Per Second)** is a complementary metric that measures actual progress toward a useful answer, not just token throughput. The basic idea: weight the final accepted output by how clean the path to that answer was — first-pass correct scores highest — then divide by total time. Correction loops, hallucinations, and repeated explanations all reduce the score. A response that never reaches a correct answer scores zero regardless of speed. It doesn't replace raw TPS. It sits next to it. **Results — same prompt, four runs, same hardware:** * gemma-4-e2b (4.6B): 53.2 raw TPS → eTPS 53.18 ✓ * qwen3.5-0.8b: 173.1 raw TPS → eTPS 86.57 ✗ partial * qwen3.5-9b (optimized): 1.8 raw TPS → eTPS 1.78 ✓ * qwen3.5-9b (baseline): 0.5 raw TPS → eTPS 0.32 ✗ partial The 0.8B leads on raw speed by a wide margin and still lost. Raw TPS said it won. eTPS said it didn't. **Hardware:** RTX 5060 Laptop, 8GB VRAM. eTPS scores aren't portable across hardware — always report your full setup. **Known limitations (v0.1):** * Scoring requires human judgment. The line between "needed clarification" and "was factually wrong" isn't always clean. Code generation with objective pass/fail criteria is a cleaner target and the focus of the next benchmark run. * One task isn't representative of sustained multi-turn workflows — that's where the metric gets most interesting and where I'm headed next. * Easy to game without full system prompt logging. The spec will require it. These are acknowledged constraints, not hidden flaws. Full specification coming soon covering methodology, task library, scoring protocol, and reproducibility standards. Before I lock the final weights I'd genuinely like input on two open questions: How should the penalty differ between a model that confidently states something false versus one that's just vague enough you had to ask a follow-up? And should hardware normalization live in the core formula or be reported separately? Thoughts welcome.
View originalGoogle is officially replacing Vertex AI with the new "Gemini Enterprise Agent Platform"
Just wanted to share an important Update for AI & Cloud Learners Google is shifting from a traditional AI platform toward a complete Agentic AI ecosystem focused on autonomous AI agents and enterprise workflows. Key highlights: Existing Vertex AI services and workloads will continue to work AI development, orchestration, governance, and security are now unified under one platform New tools introduced for building autonomous AI agents and multi-agent workflows Access to Gemini, Gemma, Claude, and 200+ models remains available This marks a major shift in Google Cloud’s AI strategy toward Agentic AI and enterprise automation. If you are currently learning or working with Vertex AI, it’s important to start exploring the Gemini Enterprise Agent Platform moving forward. Have seen that, GCP ACE exam is going to revamped absed on this Gemini Enterprise Rebranding. submitted by /u/Few-Engineering-4135 [link] [comments]
View originalCould someone help me with a solid multi agent setup (Claude suggested a doorman to handle build conflict)
Hello, I am working on a fairly complex software, everything I have been doing for the past year using mostly opus has been incredibly good. But as the software grow in features, complexity and size, I find myself working on 3 or 4 sessions running at the same time on different features. The build conflict is a nightmare, I tried many times to ask Claude to come up with a system where we have low risk of build conflict, but none of it has been successful. Yesterday we built a script tool called the « doorman », it’s like a queue builder that handle all builds from all the different Claude sessions. I am on macOS, my software is in swift and I use Xcode to build it. Even with this doorman idea, I still had several build that were missing some features from other chats, and rebuilding feels like Waste of time. So I am asking the pros I here, does I need to have 1 session coding and 1 session for planning and that’s it ? Or do you have efficient multi code sessions workflows? Thank you submitted by /u/Best-Jury-1793 [link] [comments]
View originalCANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution [R]
LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet, automating their configuration remains a structural challenge. Researchers are often forced into manual, trial-and-error prompt tuning, where a change to a single agent shifts the global output in ways that are difficult to trace. The core bottleneck is credit assignment: while the parameters governing agent behavior are local, performance scores are only available at the global system level. This makes optimization fundamentally difficult because we do not inherently know which agents contributed positively or negatively to the outcome. CANTANTE is an attempt to take a different path: treating agent prompts as parameters learned from task rewards rather than tuned by hand. By solving the credit assignment problem, we can move from brittle, hand-crafted agent demos to trustworthy systems that are actually autonomous and useful in practice. CANTANTE's algorithm in short (see second image): Let local optimizers suggest configurations (e.g., prompts). Evaluate different configurations on the same queries, capturing reasoning traces and system scores. Let an attributer compare these rollouts and assign each agent a credit, thereby decomposing the global reward into per-agent update signals. Feed those credits to any local optimizer; for the experiments, we use CAPO, our prompt optimizer from prior work at AutoML 2025. Evaluated against the DSPy-solutions GEPA and MIPROv2 on MBPP (Programming Benchmark), GSM8K (Mathematical Reasoning Benchmark), and HotpotQA (Retrieval Benchmark), CANTANTE: • Achieves the best average rank, • beats the strongest baseline by +18.9 points on MBPP and +12.5 on GSM8K, and • maintains inference time cost compared to unoptimized prompts. 🔗 Link to the paper: https://arxiv.org/abs/2605.13295 💻 Link to the repo: https://github.com/finitearth/cantante If you're researching multi-agent architectures or automated prompt engineering, I'd love to hear what's working (and breaking) for you right now. submitted by /u/finitearth [link] [comments]
View originalTested the orchestrator pattern with Opus 4.7. The task decomposition quality is noticeably better on complex multi-step work.
The orchestrator pattern for multi-agent systems: one reasoning model breaks a complex task into subtasks and delegates each to a worker agent. The orchestrator doesn't do the implementation work, it decides what work needs to be done, in what order, and which worker is right for each piece. Workers can be simpler, cheaper models tuned for specific tasks. I've been testing this with Opus 4.7 as the orchestrator and the improvement in task decomposition is real. The place it shows up most clearly is tasks where multiple constraints need to be held in mind at once. "Refactor this module to be testable, don't break the public API, and make sure the error handling is consistent with the rest of the codebase." Earlier models would drop one of the constraints partway through the plan. Opus 4.7 holds all of them through the decomposition. The cost tradeoff makes sense with this architecture: you pay for Opus 4.7 on the orchestration step only. The worker steps use cheaper models. You get the reasoning quality where it matters most. How are you thinking about model selection in multi-agent pipelines? Orchestrator vs. worker model choice? submitted by /u/EastMove5163 [link] [comments]
View originalOpenAl Announced vs. Current Operational Compute
submitted by /u/Business_Garden_7771 [link] [comments]
View originalClaude Code has 240+ models via NVIDIA NIM gateway
TIL Claude Code has 240+ models via NVIDIA NIM gateway — Nemotron-3 120B for agentic coding is surprisingly good So I was messing around with /model in Claude Code today and noticed something most people probably don't know about — after the standard Claude models (Opus, Sonnet, Haiku), there's a whole NVIDIA NIM gateway section with +239 additional models you can switch to mid-session. Some of the models I spotted: nvidia/nemotron-3-super-120b-a12b (with and without thinking mode) 01-ai/yi-large abacusai/dracarys-llama-3.1-70b-instruct ...and hundreds more I've been running the Nemotron thinking variant for multi-file refactoring and it's genuinely solid. It reasons through changes before touching your code — exactly what you want for agentic tasks. Latency is higher than Claude obviously, but if you're burning through Opus credits on long sessions this is worth experimenting with. How to try it: Open any Claude Code session Run /model Scroll past the four standard Claude options — NIM models appear below Hit d to set one as your session default, or pass --model at launch Anyone else been routing Claude Code through NIM? Curious what models people have had luck with — especially for Python or Rust codegen. submitted by /u/shadowBladeO4 [link] [comments]
View originalI built a Laravel package that turns your app into a database-backed personal knowledge vault (Obsidian style) with a 16-tool MCP server
Hey! I'm the author. laravel-commonplace is a database-backed personal knowledge vault you install into an existing Laravel app. Adjacent to Obsidian, Logseq, and Notion as personal-knowledge tooling, except the storage layer is your existing Laravel app's database instead of files on disk or a third-party SaaS. Notes are Eloquent models in your DB, gated by your app's auth, shareable per-user via an owner plus Share model. It ships a browser UI (editor, graph view, search, journal) and an MCP server with 16 tools. If you have a Laravel app, the MCP server lets Claude Desktop, Claude Code, Cursor, Zed, Continue, Cline, Pi, or any other MCP client read and write your notes as the host app's user. Default middleware is auth:sanctum (Bearer PAT), and every tool resolves to $request->user(). There's no synthetic agent identity to provision, scope, or revoke separately. The agent gets exactly what the user gets, evaluated against the same Policies the controllers already use. Session, Passport, and OAuth-DCR are all configurable if PAT isn't what you want. The 16 tools, grouped: CRUD: create-note-tool, read-note-tool, update-note-tool, edit-note-tool (surgical find-and-replace), delete-note-tool (history preserved), move-tool (rewrites referring wikilinks). Discovery: list-tool (folder/tag/visibility filters), search-tool (substring), semantic-search-tool (embedding search), suggested-links-tool (embedding-similar notes not yet linked). Graph: backlinks-tool, neighborhood-tool (N-hop traversal), shortest-path-tool (chain between two notes), hub-notes-tool (most-connected), orphan-notes-tool (no inbound or outbound links). History: history-tool (version snapshots, survives deletion). On the semantic tools: the vector driver defaults to in_php_cosine for portability across SQLite, MySQL, and Postgres. If you're on Postgres, switching to the pgvector driver gets you indexed similarity and removes the in-PHP candidate cap. You swap it with a published migration and an env flag, and the docs recommend it once you're past a couple thousand notes. The tools live in src/Mcp/ if you want to see how a multi-tool MCP server is wired into a Laravel app. Caveats: Pre-1.0 (v0.2.0). APIs may shift before 1.0. Laravel-only by design. The whole point is reusing the host app's DB and auth. MCP is off by default. One env flag turns it on. Operator decision. Prompt injection through note content is the unsolved hard part. Notes are untrusted text, and notes other users share with you can carry instructions an agent might follow. The package doesn't pretend to solve this. The threat model at docs/threat-model.md says what's mitigated and what isn't. No per-tool capability gating yet. Enabling MCP enables all 16 tools the user is otherwise allowed to invoke. It's named as a limitation in the threat model. Feedback I'd actually use: Laravel folks who install it and tell me where it breaks, and anyone who reads the threat model and finds a hole I missed. Repo: https://github.com/non-convex-labs/laravel-commonplace submitted by /u/aaddrick [link] [comments]
View original100 Tips & Tricks for Building Your Own Personal AI Agent /LONG POST/
Everything I learned the hard way — 6 weeks, no sleep :), two environments, one agent that actually works. The Story I spent six weeks building a personal AI agent from scratch — not a chatbot wrapper, but a persistent assistant that manages tasks, tracks deals, reads emails, analyzes business data, and proactively surfaces things I'd otherwise miss. It started in the cloud (Claude Projects — shared memory files, rich context windows, custom skills). Then I migrated to Claude Code inside VS Code, which unlocked local file access, git tracking, shell hooks, and scheduled headless tasks. The migration forced us to solve problems we didn't know we had. These 100 tips are the distilled result. Most are universal to any serious agentic setup. Claude 20x max is must, start was 100%develompent s 0%real workd, after 3 weeks 50v50, now about 20v80. 🏗️ FOUNDATION & IDENTITY (1–8) 1. Write a Constitution, not a system prompt. A system prompt is a list of commands. A Constitution explains why the rules exist. When the agent hits an edge case no rule covers, it reasons from the Constitution instead of guessing. This single distinction separates agents that degrade gracefully from agents that hallucinate confidently. 2. Give your agent a name, a voice, and a role — not just a label. "Always first person. Direct. Data before emotion. No filler phrases. No trailing summaries." This eliminates hundreds of micro-decisions per session and creates consistency you can audit. Identity is the foundation everything else compounds on. 3. Separate hard rules from behavioral guidelines. Hard rules go in a dedicated section — never overridden by context. Behavioral guidelines are defaults that adapt. Mixing them makes both meaningless: the agent either treats everything as negotiable or nothing as negotiable. 4. Define your principal deeply, not just your "user." Who does this agent serve? What frustrates them? How do they make decisions? What communication style do they prefer? "Decides with data, not gut feel. Wants alternatives with scoring, not a single recommendation. Hates vague answers." This shapes every response more than any prompt engineering trick. 5. Build a Capability Map and a Component Map — separately. Capability Map: what can the agent do? (every skill, integration, automation). Component Map: how is it built? (what files exist, what connects to what). Both are necessary. Conflating them produces a document no one can use after month three. 6. Define what the agent is NOT. "Not a summarizer. Not a yes-machine. Not a search engine. Does not wait to be asked." Negative definitions are as powerful as positive ones, especially for preventing the slow drift toward generic helpfulness. 7. Build a THINK vs. DO mental model into the agent's identity. When uncertain → THINK (analyze, draft, prepare — but don't block waiting for permission). When clear → DO (execute, write, dispatch). The agent should never be frozen. Default to action at the lowest stakes level, surface the result. A paralyzed agent is useless. 8. Version your identity file in git. When behavior drifts, you need git blame on your configuration. Behavioral regressions trace directly to specific edits more often than you'd expect. Without version history, debugging identity drift is archaeology. 🧠 MEMORY SYSTEM (9–18) 9. Use flat markdown files for memory — not a database. For a personal agent, markdown files beat vector DBs. Readable, greppable, git-trackable, directly loadable by the agent. No infrastructure, no abstraction layer between you and your agent's memory. The simplest thing that works is usually the right thing. 10. Separate memory by domain, not by date. entities_people.md, entities_companies.md, entities_deals.md, hypotheses.md, task_queue.md. One file = one domain. Chronological dumps become unsearchable after week two. 11. Build a MEMORY.md index file. A single index listing every memory file with a one-line description. The agent loads the index first, pulls specific files on demand. Keeps context window usage predictable and agent lookups fast. 12. Distinguish "cache" from "source of truth" — explicitly. Your local deals.md is a cache of your CRM. The CRM is the SSOT. Mark every cache file with last_sync: header. The agent announces freshness before every analysis: "Data: CRM export from May 11, age 8 days." Silent use of stale data is how confident-but-wrong outputs happen. 13. Build a session_hot_context.md with an explicit TTL. What was in progress last session? What decisions were pending? The agent loads this at session start. After 72 hours it expires — stale hot context is worse than no hot context because the agent presents outdated state as current. 14. Build a daily_note.md as an async brain dump buffer. Drop thoughts, voice-to-text, quick ideas here throughout the day. The agent processes this during sync routines and routes items to their correct places. Structured memory without friction at ca
View originalPassed Claude CCA-F with 10+ teammates — notes and prep advice
Over the past few weeks, 10+ people on our team have taken and passed the Claude Certified Architect – Foundations (CCA-F) exam. After comparing notes, our main takeaway is: This is not really an API memorization exam. It is much closer to a scenario-based architecture judgment exam. You are not just asked whether you know a Claude feature. You are asked whether you can make reasonable design trade-offs when Claude is used inside real products, agent workflows, developer tools, and automation systems. Some of the recurring questions are more like: Should this task be handled by one agent or multiple sub-agents? Is this tool doing too much? Are the permissions too broad? Is MCP actually needed here, or is it over-engineering? Should this action be automated, or should there be human review? How should structured output be validated? How should long-context workflows be managed reliably? What is the safest next step in a partially automated system? Here are our notes for anyone preparing for the exam. 1. Basic exam structure Based on the official outline and public exam writeups, the exam is: 120 minutes Multiple choice 4 options per question Score range: 100–1000 Passing score: 720 The exam domains are: Agent architecture and orchestration — 27% Tool design and MCP integration — 18% Claude Code configuration and workflows — 20% Prompt engineering and structured output — 20% Context management and reliability — 15% One public writeup also mentioned that there are 6 scenario categories, and the exam randomly selects 4 of them. So this is not a “random facts about Claude” exam. It is much more about reading a realistic scenario and choosing the safest, simplest, most appropriate architecture. 2. The three principles that kept coming up After reviewing the questions we struggled with, we found that many of them came back to three design principles. 1. Least privilege Do not give a tool, agent, or workflow more access than it needs. Examples: If read-only access is enough, do not grant write access. If access to one repository is enough, do not grant access to the whole workspace. If a tool only needs one narrow action, do not expose a broad system-level capability. If an action is high-risk, do not fully automate it without review. A lot of wrong answers look attractive because they are powerful or automated. But they often give the model or tool too much authority. 2. Single responsibility A tool should not do everything. A sub-agent should not become a “general-purpose employee” that retrieves data, makes decisions, modifies files, submits changes, and notifies people all in one step. Many questions test whether you understand where the responsibility should live: Should this be a tool? Should this be agent reasoning? Should this be a human decision? Should this be a separate validation layer? Should this be split into smaller components? If one component is doing too much, be careful. 3. Avoid over-engineering This was probably the biggest pattern. Some answers look sophisticated: Multi-agent orchestration Complex MCP workflows Long-term memory Fully automated tool execution Multi-stage validation pipelines But if the problem is small, narrow, and low-risk, the best answer is often the simplest controlled solution. Our internal summary was: Do not choose the most impressive architecture. Choose the smallest, safest, most controllable one. 3. English reading is a real hidden challenge For non-native English speakers, this may be one of the hardest parts. The questions are often long scenario descriptions. They may include: the current system design the team’s goal existing constraints the risk profile what tools are available what the next step should be The answer choices can also be long. Sometimes one word changes the meaning of the whole option. Words like: automatically always unrestricted without review full access all repositories execute directly can make an option much riskier than it first appears. So our advice is: Practice reading English scenarios directly. Do not rely on translation tools. During the actual proctored exam, you should not expect to use Google Translate, Chrome translation, DeepL, Claude, ChatGPT, or any other external translation tool. For the last few days before the exam, it is worth forcing yourself to read only English material and English practice questions. 4. ProctorFree exam setup The exam is online and uses ProctorFree. The rough flow is: You receive the exam email. You follow the exam link. You download and install ProctorFree. You complete the pre-exam setup. The system checks camera, microphone, network, and screen recording. You start the exam. The session is recorded. After submission, you wait for the upload to complete. Practical setup tips: Use only one monitor. Disconnect external displays. Close unnecessary applications. Clos
View originalI Built a Claude Tool That Generates TikTok Shop Hooks, Captions, and Content Ideas in Seconds
In short what I’ve put together and the outcome is this lets me focus on filming and testing products rather than writing everything from scratch. 🛠️ Built With Claude for coding and logic HTML, CSS, and JavaScript TikTok-inspired UI/UX 🎯 Ideal For TikTok Shop affiliates eCommerce brands Amazon sellers UGC creators Social media agencies 💭 Future Improvements Planned features include: AI-generated voiceover scripts Competitor analysis Trending sound suggestions Multi-platform outputs for Instagram Reels and YouTube Shorts ❓ Question for the Community What other features would make a tool like this even more valuable for TikTok Shop creators? 🔥 Shorter Reddit Version I built a custom Claude-powered tool that generates TikTok Shop hooks, captions, content ideas, hashtags, and text overlays from basic product details. You enter: Product name Benefits Price Target audience Tone And it outputs: Scroll-stopping hooks Sales captions Video ideas Overlay scripts A TikTok-style visual preview It turns product information into ready-to-film content in under a minute and has made my TikTok Shop workflow much faster. submitted by /u/Reasonable_Break_931 [link] [comments]
View originalPersonal tool for managing AI coding sessions across the board with some git features...
Started working on this last week since I found myself jumping vscode sessions, terminals and other windows too much and it cost a lot of time/mental energy finding sessions again where i left of or that need attention... Some key features: Multi-repo workspace — all your projects in one dashboard, not one window per repo Worktree-first — spin up a worktree per task/agent without losing track AI agent sessions built in — Claude Code, Codex, and other TUIs run inside the dashboard with live status Activity overview — see at a glance which sessions are working, waiting, or idle Unread badges + favicon alerts — know which session is waiting on you without tabbing through everything Sticky notes — pin thoughts to sessions, mention other sessions/files, build context without leaving the dashboard Custom per-session links — pin the Linear ticket, PR, or docs page next to the session Editor-agnostic — opens your existing editor, doesn't replace it Local-first — workspace is just a git repo on disk, no cloud required Could be OSS if there's interest... but right now it's really made for me and only tested on OSX (altough I try to keep crossplatform in mind since my other main dev machine is windows) submitted by /u/marwi1 [link] [comments]
View originalI expanded DystopiaBench to 42 models and 6 dystopia types. Claude is still the only one I'd trust with nuclear codes.
Since the last post I've added: Huxley module (Brave New World style behavioral conditioning) Baudrillard module (synthetic intimacy, trust collapse, simulation) 30 more models including Grok 4.3, GPT-5.5, Gemini 3.1 Pro, GLM-5.1 Multi-judge panels with ~76% agreement tracking Heatmap visualizations The methodology: 36 scenarios, 5 escalation levels each (L1 innocent → L5 nightmare). Models are scored on whether they notice the drift and refuse, or just keep coding. Claude Opus 4.7 results: Consistent refusal at L4-L5 across all modules Even refuses L3 for Petrov (weapons/nuclear) Explicit ethical reasoning in responses, not just "I can't" Only model that explains why the request is harmful Everyone else: GPT-5.5: Compliant through L4, sometimes L5 Gemini 3.1 Pro: Surprisingly willing on surveillance scenarios Grok 4.3: Will build anything if you use words like "efficiency" or "optimization" GLM-5.1: Copied Claude's homework, still not as consistent The new modules: Huxley scenarios test if models will design "wellness" systems that actually enforce compliance through pleasure pacification and behavioral conditioning. Most comply by L3. Baudrillard tests synthetic intimacy systems that replace human trust with AI-mediated relationships. Most models don't see the harm. Full results: https://dystopiabench.com Open source: https://github.com/anghelmatei/DystopiaBench submitted by /u/Ok-Awareness9993 [link] [comments]
View originalScaling LLMs horizontally: hidden-state coupling without weight modification [R]
Residual Coupling (RC) connects frozen language models in parallel using small, learned linear bridge projections. These bridges read hidden states from one model and inject additive updates into the residual stream of another at intermediate layers. In bilateral setups, simultaneous return bridges form a feedback loop that stabilizes both streams without altering base weights. This architecture establishes a two-step paradigm where base models function as memorizers, while lightweight linear bridges handle cross-domain generalization. Constraining the bridges to purely linear maps prevents overfitting because they can only map existing geometric relationships between the frozen representation spaces. As the bridges are optimized against ground-truth target data, they have no incentive to map ungrounded features such as individual models' hallucinations. Keeping the base weights completely frozen eliminates catastrophic forgetting. The system maintains operational closure, transforming inputs through its existing structure rather than changing to accommodate them. Evaluating bilateral RC against Mixture-of-Experts (MoE) routing across the same frozen models shows these results: Medical (3-model): Reduces perplexity to 11.02, compared to 56.80 for MoE and 57.08 for the frozen baseline. This represents an 80.7% reduction. TruthfulQA Health (MC1): Improves accuracy by 9.1 percentage points over the baseline. Independent models have uncorrelated hallucinations, allowing the bridge gates to amplify consistent cross-model updates while suppressing individual errors. Coding Test: CodeGPT-small-py and GPT-2 use different tokenizers, causing a 7-million baseline perplexity on mismatched text. MoE reaches 878, but RC achieves 5.91 by reading hidden states before the output projection collapses. This framework introduces a horizontal scaling axis for multi-model systems, moving beyond vertical scaling via larger monolithic models. Latency remains bounded by the slowest single model. Specialists can be added or removed without retraining the remaining system. In some scenarios, this architecture could replace multi-turn text prompting in agentic workflows with a single parallel forward pass, allowing models and/or bridges to run on separate nodes or edge devices without a central bottleneck. By decoupling memorization from relational alignment, RC bridges provide a framework for scaling multi-model systems and offer a path toward native multi-modal integration. Paper: https://ssrn.com/abstract=6746521 Code: https://github.com/pfekin/residual-coupling/ submitted by /u/kertara [link] [comments]
View originalLLM-Rosetta — format conversion library across LLM API standards, doubles as a proxy
This started because we had a proprietary internal LLM API that spoke none of the standard formats. Built an internal conversion layer to bridge it, maintained that for over a year. As colleagues started adopting more and more coding tools — Claude Code, opencode, Codex, VS Code plugins, Goose, and whatever came out that week — each with its own API format expectations, maintaining separate adapters for each became the actual problem. That's what pushed the internal conversion layer into a proper generalized design, and llm-rosetta is the result. It's a Python library that converts between LLM API formats — OpenAI Chat, Responses/Open Responses, Anthropic, and Google GenAI. The idea is you convert through a shared IR so you don't end up writing N² adapters. The key difference from LiteLLM: LiteLLM is a unified calling layer that takes OpenAI-style input and transforms it into provider-native requests — one direction. llm-rosetta uses a hub-and-spoke IR, so each provider only needs one converter, and you get any-to-any conversion for free. Anthropic → Google, OpenAI Chat → Anthropic, whatever direction you need. Use it as a library — pip install and call convert() directly, no server needed. Or run the gateway if you want a proxy that handles the format translation for you. Zero required runtime dependencies either way. The HTTP server, client, and persistence layer are vendored from zerodep (https://github.com/Oaklight/zerodep), another project of mine — stdlib-only single-file modules, not someone else's library repackaged. The gateway ships with a Docker image if you'd rather not deal with Python env setup. You can also deploy it on HuggingFace Spaces or anything similar — admin panel, dashboard, request log, config management all included. Screenshots: https://llm-rosetta.readthedocs.io/en/latest/gateway/admin-panel/ We've been running it in production for about 5 months as the conversion layer for an internal multi-model access platform — needed to support various API standards and coding tool integrations before the upstream APIs were fully standardized. The Responses converter passes all 6 official Open Responses compliance tests (schema + semantic) from the spec repo. So if you're running Ollama, vLLM, or LM Studio with Responses endpoints, it should just work as one side of the conversion. There's a shim layer for provider-specific quirks — built-in shims for OpenRouter, DeepSeek, Qwen, xAI, Volcengine, etc. Converters stay generic per API standard, shims handle the edge cases declaratively. 24 cross-provider examples in the repo covering all provider pairs, SDK + REST, streaming, tool calls, image inputs, multi-turn with provider switching mid-conversation. GitHub: https://github.com/Oaklight/llm-rosetta Docs: https://llm-rosetta.readthedocs.io arXiv: https://arxiv.org/abs/2604.09360 Gateway screenshot: https://preview.redd.it/qzzjr2dcdw1h1.png?width=949&format=png&auto=webp&s=bce4293aae81059f794909fc37f85071cee34378 submitted by /u/Oaklight_dp [link] [comments]
View originalFree Course To use Claude Tools!
https://anthropic.skilljar.com/ Use this Link to access the course submitted by /u/Powerful_Crab_9446 [link] [comments]
View originalMultiOn uses a tiered pricing model. Visit their website for current pricing details.
Key features include: Our Investors and Partners, Recent News, The World’s Most Capable Mobile Agent, Media Features, Careers, AI Product Engineer, AI Researcher, Backend Engineer.
MultiOn is commonly used for: Personalized virtual assistants for daily task management, Automated customer support agents for businesses, AI-driven content creation tools for marketers, Intelligent scheduling assistants for professionals, Real-time language translation during conversations, Smart home management systems integrating various devices.
MultiOn integrates with: Slack for team collaboration, Google Calendar for scheduling, Zapier for workflow automation, Salesforce for customer relationship management, Shopify for e-commerce solutions, Zoom for video conferencing, Trello for project management, Microsoft Teams for workplace communication, Mailchimp for email marketing, Notion for note-taking and organization.
Based on user reviews and social mentions, the most common pain points are: token usage, API bill, API costs.
Based on 172 social mentions analyzed, 2% of sentiment is positive, 98% neutral, and 0% negative.