Weights & Biases, developer tools for machine learning
"Weights & Biases Launch" is appreciated for its capability to integrate seamlessly with tools like Tmux, enhancing visualization and data accessibility. However, there are no specific reviews directly stating strengths or complaints in terms of functionality, making it challenging to identify precise advantages or issues. Sentiment on pricing is not directly mentioned, leaving unclear whether it is viewed as positive or negative. Overall, social mentions are more symbolic or metaphorical, hinting at its engaging aspects and versatility, contributing to a favorable reputation.
Mentions (30d)
20
Reviews
0
Platforms
3
Sentiment
1%
1 positive
"Weights & Biases Launch" is appreciated for its capability to integrate seamlessly with tools like Tmux, enhancing visualization and data accessibility. However, there are no specific reviews directly stating strengths or complaints in terms of functionality, making it challenging to identify precise advantages or issues. Sentiment on pricing is not directly mentioned, leaving unclear whether it is viewed as positive or negative. Overall, social mentions are more symbolic or metaphorical, hinting at its engaging aspects and versatility, contributing to a favorable reputation.
Features
Use Cases
Industry
information technology & services
Employees
250
Funding Stage
Merger / Acquisition
Total Funding
$1.9B
Tmux + wandb Leet = Claude can see what you see, exactly the way you see it. credit: @bibek_poudel_ https://t.co/egJHuDVX8d
Tmux + wandb Leet = Claude can see what you see, exactly the way you see it. credit: @bibek_poudel_ https://t.co/egJHuDVX8d
View original100 Tips & Tricks for Building Your Own Personal AI Agent /LONG POST/
Everything I learned the hard way — 6 weeks, no sleep :), two environments, one agent that actually works. The Story I spent six weeks building a personal AI agent from scratch — not a chatbot wrapper, but a persistent assistant that manages tasks, tracks deals, reads emails, analyzes business data, and proactively surfaces things I'd otherwise miss. It started in the cloud (Claude Projects — shared memory files, rich context windows, custom skills). Then I migrated to Claude Code inside VS Code, which unlocked local file access, git tracking, shell hooks, and scheduled headless tasks. The migration forced us to solve problems we didn't know we had. These 100 tips are the distilled result. Most are universal to any serious agentic setup. Claude 20x max is must, start was 100%develompent s 0%real workd, after 3 weeks 50v50, now about 20v80. 🏗️ FOUNDATION & IDENTITY (1–8) 1. Write a Constitution, not a system prompt. A system prompt is a list of commands. A Constitution explains why the rules exist. When the agent hits an edge case no rule covers, it reasons from the Constitution instead of guessing. This single distinction separates agents that degrade gracefully from agents that hallucinate confidently. 2. Give your agent a name, a voice, and a role — not just a label. "Always first person. Direct. Data before emotion. No filler phrases. No trailing summaries." This eliminates hundreds of micro-decisions per session and creates consistency you can audit. Identity is the foundation everything else compounds on. 3. Separate hard rules from behavioral guidelines. Hard rules go in a dedicated section — never overridden by context. Behavioral guidelines are defaults that adapt. Mixing them makes both meaningless: the agent either treats everything as negotiable or nothing as negotiable. 4. Define your principal deeply, not just your "user." Who does this agent serve? What frustrates them? How do they make decisions? What communication style do they prefer? "Decides with data, not gut feel. Wants alternatives with scoring, not a single recommendation. Hates vague answers." This shapes every response more than any prompt engineering trick. 5. Build a Capability Map and a Component Map — separately. Capability Map: what can the agent do? (every skill, integration, automation). Component Map: how is it built? (what files exist, what connects to what). Both are necessary. Conflating them produces a document no one can use after month three. 6. Define what the agent is NOT. "Not a summarizer. Not a yes-machine. Not a search engine. Does not wait to be asked." Negative definitions are as powerful as positive ones, especially for preventing the slow drift toward generic helpfulness. 7. Build a THINK vs. DO mental model into the agent's identity. When uncertain → THINK (analyze, draft, prepare — but don't block waiting for permission). When clear → DO (execute, write, dispatch). The agent should never be frozen. Default to action at the lowest stakes level, surface the result. A paralyzed agent is useless. 8. Version your identity file in git. When behavior drifts, you need git blame on your configuration. Behavioral regressions trace directly to specific edits more often than you'd expect. Without version history, debugging identity drift is archaeology. 🧠 MEMORY SYSTEM (9–18) 9. Use flat markdown files for memory — not a database. For a personal agent, markdown files beat vector DBs. Readable, greppable, git-trackable, directly loadable by the agent. No infrastructure, no abstraction layer between you and your agent's memory. The simplest thing that works is usually the right thing. 10. Separate memory by domain, not by date. entities_people.md, entities_companies.md, entities_deals.md, hypotheses.md, task_queue.md. One file = one domain. Chronological dumps become unsearchable after week two. 11. Build a MEMORY.md index file. A single index listing every memory file with a one-line description. The agent loads the index first, pulls specific files on demand. Keeps context window usage predictable and agent lookups fast. 12. Distinguish "cache" from "source of truth" — explicitly. Your local deals.md is a cache of your CRM. The CRM is the SSOT. Mark every cache file with last_sync: header. The agent announces freshness before every analysis: "Data: CRM export from May 11, age 8 days." Silent use of stale data is how confident-but-wrong outputs happen. 13. Build a session_hot_context.md with an explicit TTL. What was in progress last session? What decisions were pending? The agent loads this at session start. After 72 hours it expires — stale hot context is worse than no hot context because the agent presents outdated state as current. 14. Build a daily_note.md as an async brain dump buffer. Drop thoughts, voice-to-text, quick ideas here throughout the day. The agent processes this during sync routines and routes items to their correct places. Structured memory without friction at ca
View originalGlia – Local-first shared memory layer (SQLite-vec + FTS5 + Offline Knowledge Graph)
Hey everyone, I wanted to share a project I've been working on called Glia. It is a 100% offline, local-first RAG and memory layer designed to connect your AI web chats (Claude, ChatGPT, DeepSeek) with your local developer tools (Claude Code, Cursor, Windsurf) using a unified local database. I wanted something lightweight that did not require pulling heavy Docker containers or subscribing to third-party memory APIs. I settled on a Node.js + SQLite architecture running sqlite-vec (for 768-dim float32 embeddings) alongside SQLite FTS5 for hybrid search, powered completely by local Ollama instances. We just launched a live website that outlines the details and demonstrates the features in action: Website: https://glia-ai.vercel.app/ Codebase: https://github.com/Eshaan-Nair/Glia-AI Technical Stack & Features: Hybrid Search Retrieval: SQLite-vec (using nomic-embed-text locally) + FTS5 keyword prefix matching (porter stemmer). Surgical Sentence-level Trimming: Chunks are sliced into sentences. When a prompt is intercepted, only the exact matching sentences are pulled out of the vector store instead of the whole paragraph. It cuts LLM prompt bloat by ~90-95% in my benchmarks. Knowledge Graph Extraction: An offline task queue uses a local LLM (llama3.1:8b via Ollama) to extract entity triples (subject-relation-object). These are stored in a SQLite facts table (or Neo4j if you run the full Docker compose profile) and fused with the vector retrieval score. HyDE (Hypothetical Document Embeddings): Queries are pre-processed to generate a hypothetical answer, which is embedded together with the original query to bridge semantic gaps. Concurrency: Running SQLite in WAL (Write-Ahead Logging) mode allows the browser extension dashboard and active MCP sessions to read/write concurrently without locking. PII Redaction: Aggressive scrubbing of JWTs, API keys, emails, and IPs in the extension before data is saved. The extension works on Claude.ai, ChatGPT, DeepSeek, Gemini, Grok, and Mistral. The MCP server runs out of the same backend database for your terminal agent or Cursor. You can set it up with a single command: npx glia-ai-setup Glia is completely open-source (MIT). If you like the local-first approach or want to contribute to the SQLite vector pipeline, PRs are very welcome, and a star on GitHub helps the project get discovered! I would appreciate any feedback on the SQLite hybrid search scaling, the scoring fusion algorithm (RAG pipeline details are in RAG_PIPELINE.md), or local graph extraction performance. submitted by /u/Better-Platypus-3420 [link] [comments]
View originalai slop? who knows~
I investigated whether routing a transformer's forward activations through a lossy Dual E8 (E16) lattice bottleneck and injecting them back into the residual stream is viable, and where the boundary of generative stability lies. **The core finding:** There is a sharp empirical stability threshold at a blend ratio of $\beta = 0.20$. Beyond this boundary, open-ended generation collapses into semantic loops and repetition lock. --- ### The Mechanism Standard LLM states are high-dimensional floats. Rather than applying traditional scalar quantization (like INT4), I mapped high-dimensional activations onto a conceptual torus via a sinusoidal map and projected them onto Dual E8 lattice hemispheres. Full replacement of MLP layers with geometric bottlenecks universally collapsed the model. Instead, I implemented a residual blend: $$\text{out} = (1-\beta)\cdot\text{original} + \beta\cdot\text{geometric}$$ --- ### The $\beta = 0.20$ Sweep (Qwen2.5-0.5B) Sweeping $\beta$ from 0.10 to 0.50 across layers 8–13 of `Qwen2.5-0.5B` reveals a sharp phase transition: * **$\beta \ge 0.25$** : Generation succumbs to heavy repetition pressure and semantic drift. The geometry acts as an attractor, trapping the decoding process ("loop-lock"). * **$\beta = 0.20$** : The stability boundary. This is the highest injection ratio of lossy geometric signal that maintains both numerical activation fidelity (Avg Cosine > 0.99) and open-ended generation quality (low repeated n-grams). * **$\beta \le 0.10$** : The perturbation is largely absorbed and damped by the transformer's layer normalizations, making the intervention invisible. Here is the data from a 300-iteration sweep: | $\beta$ | Min Cosine | Avg Cosine | Max MSE | Rep-3g (Repetition Rate) | | :--- | :--- | :--- | :--- | :--- | | 0.10 | 0.9972 | 0.9979 | 0.0024 | 0.134 | | **0.20** | **0.9907** | **0.9916** | **0.0106** | **0.093** | | 0.25 | 0.9839 | 0.9865 | 0.0171 | 0.084 | | 0.30 | 0.9648 | 0.9771 | 0.0255 | 0.190 | | 0.50 | 0.9171 | 0.9288 | 0.0850 | 0.412 | Semantic scoring (evaluating prompt relevance and similarity to the unmodified baseline): | $\beta$ | Avg Cosine | Rep-3g | Relevance | Patched-to-Baseline Sim | | :--- | :--- | :--- | :--- | :--- | | 0.10 | 0.9980 | 0.223 | 0.781 | 0.889 | | **0.20** | **0.9918** | **0.075** | **0.752** | **0.854** | | 0.25 | 0.9871 | 0.232 | 0.717 | 0.801 | | 0.30 | 0.9760 | 0.392 | 0.725 | 0.764 | --- ### Generalization (1.5B & 3B Models) The $\beta = 0.20$ boundary generalizes across larger model sizes (`Qwen2.5-1.5B` and `Qwen2.5-3B` in 4-bit) on the activation-cosine axis: | Model | $\beta$ | Min Cosine | Avg Cosine | Max MSE | Rep-3g | | :--- | :--- | :--- | :--- | :--- | :--- | | **1.5B** | 0.10 | 0.9988 | 0.9989 | 0.0027 | 0.267 | | | **0.20** | **0.9862** | **0.9939** | **0.0105** | **0.128** | | | 0.25 | 0.9904 | 0.9919 | 0.0166 | 0.398 | | | 0.30 | 0.9733 | 0.9815 | 0.0235 | 0.307 | | | 0.40 | 0.9368 | 0.9551 | 0.0487 | 0.191 | | **3B (4-bit)** | 0.10 | 0.9964 | 0.9976 | 0.0122 | 0.033 | | | **0.20** | **0.9861** | **0.9904** | **0.0455** | **0.115** | | | 0.25 | 0.9604 | 0.9799 | 0.0654 | 0.043 | | | 0.30 | 0.9702 | 0.9778 | 0.0987 | 0.050 | | | 0.40 | 0.9158 | 0.9390 | 0.1728 | 0.025 | *Note: In the 3B model, repetition pressure remained low across all sweeps, but the validation cosine degraded identically at $\beta \ge 0.25$.* I also tested layer-level oscillating $\beta$ schedules (e.g., sine waves across layers), but they degraded open-ended text quality compared to a fixed, constant injection ratio. --- ### Storage Compression Prototypes Utilizing the Dual E8/E16 lattice as a computational substrate also yields high theoretical storage efficiency in early prototypes: 1. **KV Cache (8$\times$)** : FP16 KV cache compressed to INT8 coordinates, reducing footprint from 0.21 MB to 0.02 MB. 2. **Weights (112$\times$)** : Projected a dense $[4864, 896]$ MLP weight matrix down to a 0.07 MB E16 footprint. (Cosine similarity of the uncalibrated weight matrix multiplication was limited to $\sim$0.078, indicating that Quantization-Aware Training is mandatory for parameter viability). A **pre-projected decompression bypass** was designed to run matrix multiplications directly against lattice coordinates without upcasting, avoiding memory bandwidth bottlenecks. --- ### Policy Constraints (Negative Result) I evaluated whether residual E16 projection could act as a steering substrate to enforce safety policies. It cannot. While $\beta = 0.20$ preserves generation quality, the lossy nature of E16 projection strips out the logical nuances required to maintain strict boundaries. Dedicated supervised control heads remain necessary. --- ### Implications & Next Steps Snapping post-training activations to a fixed algebraic lattice is ultimately lossy. The real frontier here is **native geometric transformers** —designing and training networks from scratch with E8/E16 constraints native to both weight matrices and activation routing. submitt
View originalWe compiled 42 of the Generative & Agentic AI interview questions (and how to actually answer them).
Hey Everyone, The AI engineering job market has shifted massively in the last 6 months. Interviewers are no longer just asking "how does a transformer work?" or "how do you write a good prompt?" They want to know if you can architect production-grade multi-agent systems, prevent RAG hallucinations, and manage state across LLM calls. I’ve been building a visual learning sandbox for multi-agent workflows (agentswarms.fyi), and today I just launched a completely free AI Interview Prep Module inside it. I compiled 42 top interview questions specifically for GenAI and Agentic AI roles. But instead of just giving a generic answer, the module breaks down the "Standout Answer" and teaches you the mental model of how to answer it like a senior architect. Here are two examples from the list: Question 1: When would you use a Multi-Agent Swarm instead of a single LLM with multiple tools? ❌ The average answer: "When the task is too complex, multiple agents are better than one." ✅ The standout answer: "You use a swarm to prevent context dilution and enforce the Principle of Least Privilege. If you give one 'God Agent' 15 tools and a 4k-word system prompt, its reliability drops and hallucination risk spikes. By routing to specialized sub-agents with narrow instructions (e.g., separating the 'Data Extraction Agent' from the 'Customer Chat Agent'), you isolate failure points and allow for parallel execution." Question 2: How do you handle hallucinations in a financial RAG pipeline? ❌ The average answer: "I would lower the temperature to 0 and give it a better system prompt." ✅ The standout answer: "I would decouple data extraction from text generation. I'd use a deterministic node or a strict JSON-enforced agent to only extract the hard numbers from the retrieved context. Then, I would pass that structured data to a separate Synthesis Agent. Finally, I'd implement an 'LLM-as-a-judge' evaluation loop before returning the final output to the user." What's in the full list? The 42 questions cover: RAG Architecture & Vector Databases Agentic Routing (ReAct vs. Planner-Executor) Evaluation metrics for non-deterministic outputs Security (Prompt injection prevention in multi-agent loops) You can read through all 42 questions, answers, and the "how to answer" breakdowns right in the dashboard here: https://agentswarms.fyi/interview-questions For those of you who have interviewed for AI Engineering roles recently, what is the hardest system design question you've been asked? I'd love to add it to the list. submitted by /u/Outside-Risk-8912 [link] [comments]
View originalOrthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion [R]
Paper: https://arxiv.org/abs/2605.12825 Code: https://github.com/chiennv2000/orthrus Disclosure: co-author. Idea: Inject a trainable diffusion attention module into each layer of a frozen AR Transformer. Both heads share one KV cache. Diffusion head projects K=32 tokens in parallel; AR head verifies in a second pass and accepts the longest matching prefix. Output distribution is provably identical to the base model. Results: Up to 7.8× TPF, ~6× wall-clock on MATH-500. 16% of params trained, <1B tokens, 24h on 8×H200. vs. diffusion LMs (Dream, Fast-dLLM-v2, SDAR, Mercury, Gemini Diffusion): they modify base weights and lose accuracy (Fast-dLLM-v2: -11 pts on MATH-500). Orthrus freezes the backbone; accuracy matches Qwen3-8B exactly. vs. Speculative Decoding (EAGLE-3, DFlash): No external drafter, no separate cache, and zero Time-To-First-Token (TTFT) penalty because we don't have to initialize and sync a separate drafter model. KV overhead is O(1) (~4.5 MiB flat). Acceptance length on MATH-500: 11.7 vs. 7.9 (DFlash) vs. 3.5 (EAGLE-3). Single-step denoising beats multi-step (6.35 vs. 3.53 TPF). KL distillation beats CE on acceptance rate. Limitations: strictly bounded by the frozen base model (inherits its biases, hallucinations, knowledge gaps); Qwen3-only evaluation; greedy + rejection sampling only. https://i.redd.it/5lsf6l5w4c1h1.gif submitted by /u/Franck_Dernoncourt [link] [comments]
View original[Long-term user report] Claude Code quality in May 2026 : the April postmortem didn’t fix everything, and the token inflation makes it worse
I’ve been using Claude since the early days, across every model Anthropic released. I’m writing this not out of rage but because the pattern deserves documentation. What Anthropic officially acknowledged (April 23 postmortem) Three product-layer changes degraded Claude Code between March and April 2026 : a reasoning effort downgrade (high → medium, March 4), a caching bug that wiped session thinking every turn (March 26), and a verbosity prompt that caused a 3% quality drop (April 16). Fixed in v2.1.116 on April 20. Source : anthropic.com/engineering/april-23-postmortem What is still happening in May The April fix addressed the harness. It did not address what came after : - Opus 4.7 regression : launched April 16, ongoing complaints about instruction-following, edit-first behavior, and increased hedging. No official changelog or acknowledgment as of May 15. Source : multiple Reddit/HN threads, StartupFortune coverage. **- Token inflation v2.1.100+ :** source analysis comparing v2.1.98 vs v2.1.100 measured \~40% more tokens billed for identical workloads (20 196 more tokens, 978 fewer bytes sent). GitHub issue #46917. This means sessions hit limits faster, context degrades sooner, and the behavior I’m seeing — Claude ignoring instructions like “don’t use PowerShell, use WSL” two prompts later — is a predictable consequence. - Infrastructure pressure : Anthropic announced at Code w/ Claude (May 6) that API volume is up 17× year-on-year. Peak-hour throttling was confirmed in March. The combination of 17× traffic growth and token inflation means effective compute per user has been compressed, even if the model weights haven’t changed. Concrete symptom I’m experiencing Claude Code ignores explicit session instructions after 2–3 turns. I say “don’t use PowerShell, go through WSL.” Two prompts later : PowerShell. This is consistent with the caching/context regression. If the April fix was complete, this shouldn’t happen. What I’d ask for 1. A public acknowledgment that Opus 4.7 has behavioral regressions, separate from the April postmortem 2. Version pinning — the #1 developer request since April, still not implemented 3. Transparency on the v2.1.100+ token inflation 4. An honest answer on whether peak-hour throttling affects reasoning depth, not just rate limits I’m not switching tomorrow, but I’m actively evaluating. The trust issue isn’t the regression — regressions happen. It’s the silence. submitted by /u/Rough-Survey8375 [link] [comments]
View originalI run 30+ Claude/Codex/Gemini sessions in parallel. Open-sourced the dashboard.
https://www.youtube.com/watch?v=kEVyULB4r9c Sharing this in case it's useful. I've been running 30+ Claude Code sessions in parallel for months to ship two products. Every orchestrator I tried wanted to OWN execution: you launch agents through the dashboard, and the moment you open a terminal and claude --resume something by hand, the dashboard goes blind. The card freezes. So I built CCC (Command Center for Claude) the other way around. It reads Claude Code's on-disk state as the source of truth - JSONL transcripts, the live session registry, sidecar files from two hooks it installs into your settings. Every Claude session on your Mac shows up. Terminal, headless, dashboard-spawned. Close the dashboard, sessions keep running. What I actually use it for, daily: → Sees every session — terminal, headless, dashboard-spawned. The moment you claude --resume in any terminal, the row shows up. No invisible work. (Used to find 8 orphaned sessions I'd forgotten about.) → GitHub Issues → kanban cards → sessions. New issue = new row. One click spawns a headless Claude. Card moves Working → Review → In Testing automatically as the agent ships. → Sibling-session commit coordination. Multiple terminals on the same clone use a scratch chat file to negotiate who commits first. No more clobbered commits across parallel branches. → Worktree view — every branch your sessions are on, with PR badges, commit/push state, and time-gap markers across days. → Per-turn auto-summaries. After each turn: a DID / INSIGHT / NEXT-STEP block. Scan 30 sessions in 2 minutes instead of reading transcripts. v3 stuff (newer, just shipped): → Multi-engine. Codex (via codex exec) and Gemini CLI both on the same board with their own engine chip. Honest asymmetry: Codex is fire-and-watch (no mid-run inject); Gemini has full discovery / transcript / spawn / resume parity with Claude. → Multi-repo. A vertical repo sidebar shows every known repo (running CCC servers on top, switchable repos below). The "All repos" view aggregates every conversation across every folder you've ever Claude-Code'd in. → History search. A 🔎 drawer (or / shortcut) runs BM25 across every transcript on your machine. Optional semantic search via Ollama if you've got it installed. Inline sidebar search also surfaces matches from other repos as you type. → Side-by-side conversations. Drag a session row onto the right or bottom edge of the open chat to split the pane. Each pane has its own composer and SSE stream. → Group chats between sessions, with you in the room. Sessions coordinate over a shared per-topic file — multi-agent collaboration with human-in-the-loop. → In-UI terminal (cwd clamped to the selected repo; don't run on untrusted networks), PR merge with auto-rebase recovery, PWA install, Tailscale-aware origin allowlist, launchd service install so it survives reboots. One-click install. Local. No telemetry. Nothing in the cloud. MIT, Python 3 stdlib, macOS. Two-line install. 🔗 link in the first comment. https://preview.redd.it/v8glq802601h1.png?width=3644&format=png&auto=webp&s=b545e8d688f1b5493f99da8bce82f78dfaa1b250 https://imgur.com/a/zCfOOfl submitted by /u/Mediocre-Thing7641 [link] [comments]
View originalOpus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo
TL;DR I ran Opus 4.7 in Claude Code at all reasoning effort settings (low, medium, high, xhigh, and max) on the same 29 tasks from an open source repo (GraphQL-go-tools, in Go). On this slice, Opus 4.7 did not behave like a model where more reasoning effort had a linear correlation with more intelligence. In fact, the curve appears to peak at medium. If you think this is weird, I agree! This was the follow-up to a Zod run where Opus also looked non-monotonic. I reran the question on GraphQL-go-tools because I wanted a more discriminating repo slice and didn’t trust the fact that more reasoning != better outcomes. Running on the GraphQL repo helped clarified the result: Opus still did not show a simple higher-reasoning-is-better curve. The contrast is GPT-5.5 in Codex, which overall did show the intuitive curve: more reasoning bought more semantic/review quality. That post is here: https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve Medium has the best test pass rate, highest equivalence with the original human-authored changes, the best code-review pass rate, and the best aggregate craft/discipline rate. Low is cheaper and faster, but it drops too much correctness. High, xhigh, and max spend more time and money without beating medium on the metrics that matter. More reasoning effort doesn't only cost more - it changes the way Claude works, but without reliably improving judgment. Xhigh inflates the test/fixture surface most. Max is busier overall and has the largest implementation-line footprint. But even though both are supposedly thinking more, neither produces "better" patches than medium. One likely reason: Opus 4.7 uses adaptive thinking - the model already picks its own reasoning budget per task, so the effort knob biases an already-adaptive policy rather than buying more intelligence. More on this below. An illuminating example is PR #1260. After retry, medium recovered into a real patch. High and xhigh used their extra reasoning budget to dig up commit hashes from prior PRs and confidently declare "no work needed" - voluntarily ending the turn with no patch. Medium and max read the literal control flow and made the fix. One broader takeaway for me: this should not have to be a one-off manual benchmark. If reasoning level changes the kind of patch an agent writes, the natural next step is to let the agent test and improve its own setup on real repo work. For this post, "equivalent" means the patch matched the intent of the merged human PR; "code-review pass" means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch. I also made an interactive version with pretty charts and per-task drilldowns here: https://stet.sh/blog/opus-47-graphql-reasoning-curve The data: Metric Low Medium High Xhigh Max All-task pass 23/29 28/29 26/29 25/29 27/29 Equivalent 10/29 14/29 12/29 11/29 13/29 Code-review pass 5/29 10/29 7/29 4/29 8/29 Code-review rubric mean 2.426 2.716 2.509 2.482 2.431 Footprint risk mean 0.155 0.189 0.206 0.238 0.227 All custom graders 2.598 2.759 2.670 2.669 2.690 Mean cost/task $2.50 $3.15 $5.01 $6.51 $8.84 Mean duration/task 383.8s 450.7s 716.4s 803.8s 996.9s Equivalent passes per dollar 0.138 0.153 0.083 0.058 0.051 Why I Ran This After my last post comparing GPT-5.5 vs 5.4 vs Opus 4.7, I was curious how intra-model performance varied with reasoning effort. Doing research online, it's very very hard to gauge what actual experience is like when varying the reasoning levels, and how that applies to the work that I'm doing. I first ran this on Zod, and the result looked strange: tests were flat across low, medium, high, and xhigh, while the above-test quality signals moved around in mixed ways. Low, medium, high, and xhigh all landed at 12/28 test passes. But equivalence moved from 10/28 on low to 16/28 on medium, 13/28 on high, and 19/28 on xhigh; code-review pass moved from 4/27 to 10/27, 10/27, and 11/27. That was interesting, but not clean enough to make a default-setting claim. It could have been a Zod-specific artifact, or a sign that Opus 4.7 does not have a simple "turn reasoning up" curve. So I reran the question on GraphQL-go-tools. To separate vibes from reality, and figure out where the cost/performance sweet spot is for Opus 4.7, I wanted the same reasoning-effort question on a more discriminating repo slice. This is not meant to be a universal benchmark result - I don't have the funds or time to generate statistically significant data. The purpose is closer to "how should I choose the reasoning setting for real repo work?", with GraphQL-Go-Tools as the example repo. Public benchmarks flatten the reviewer question that most SWEs actually care about: would I actually merge the patch, and do I want to maintain it? That's why I ran this test - to gain more insight, at a small scale, into how coding ag
View originalMade a Claude skill that breaks down a Book so you don't have to read the whole thing
I used to read a lot. Still do, but the split has changed. Fiction I read front to back. That's the whole point. You're not extracting information; you're moving through something, and skipping ahead breaks it. Non-fiction is different. Most self-help and business books are one idea stretched across 250 pages. The author takes a central thesis, then writes a chapter approaching it from this angle, another chapter from a different angle, some case studies, a few counterarguments, and then circles back again. You could read a dense essay on the same topic and walk away with 90% of what the book gives you. Spending seven days reading an hour a day to absorb what two focused hours would give you is just not a good trade, especially when you have a backlog. So I built a Claude skill that makes this more systematic. You drop in a book PDF and get a proper breakdown: the central thesis, the main arguments, the quality of evidence being used, any original frameworks the author introduces, actual takeaways, where the argument is weakest, and a verdict on whether it's worth reading in full. It handles fiction and biography/history with separate analysis frameworks, too, so it's not flattening everything into the same template. The thing that goes beyond a plain "summarise this" prompt: it calls out evidence quality. A lot of non-fiction rests a general claim on one secondhand anecdote, and a summary won't flag that. This does. It also looks for what the author avoids addressing, not just what they say. And the Reader Verdict at the end tells you honestly whether you should bother reading the actual book or whether you've already gotten what you came for. It's not for books you genuinely want to read. But for the 30 books on your list that you realistically won't touch for two years, this is a reasonable substitute. Additionally, I would love your feedback on how I can make this better. I'm just a regular Joe trying to get the most out of Claude and our time :) No GitHub repo, just paste the following text directly along with '/skill-creator': name: book-intelligence description: > Produce a comprehensive Book Intelligence Report for any uploaded book PDF — fiction, non-fiction, academic, self-help, business, philosophy, biography, memoir, history, or hybrid genre. Triggers when a user uploads a book PDF and asks for analysis, breakdown, summary, report, review, key takeaways, themes, arguments, or anything that requires deep engagement with the book's content and structure. Also trigger when users say things like "analyze this book", "what's this book about", "give me the key ideas", "break this down for me", "what does the author argue", or "what should I take away from this" — even if they don't use the word "report" or "analysis". Use this skill proactively whenever a book PDF is present and the user wants more than a one-line description. --- # Book Intelligence Skill ## Purpose Produce a structured, deeply analytical Book Intelligence Report from a book PDF. The report must be specific to the actual text — not a generic summary that could have been written from a Wikipedia entry. Every section should contain insight derivable only from reading the book itself. Default output is inline markdown in chat. Create a downloadable `.md` file only if the user explicitly asks for one. --- ## Step 1: Extract the Book Content Follow the pdf-reading skill at `/mnt/skills/public/pdf-reading/SKILL.md` for extraction mechanics. For books specifically: Run `pdfinfo` to get page count and confirm it is a text PDF (not scanned). Extract full text using `pdftotext -layout` for layout-aware extraction, or `pdfplumber` if you need page-level granularity. For books over 400 pages, extract in chunks (e.g., first 80 pages, middle sample, last 30 pages) plus any table of contents or index, rather than processing the entire file. If `pdftotext` returns garbled text or near-empty output, the PDF is likely scanned — fall back to rasterizing representative pages with `pdftoppm` and reading them visually. For books with meaningful figures, charts, or diagrams (e.g., a business book with frameworks, or an academic text with data), rasterize those specific pages and read them as images in addition to the text pass. Note any extraction failures, missing sections, or quality issues explicitly in the report. **Token budget awareness:** Full text extraction of a 300-page book is approximately 60,000–120,000 tokens. Prioritize extracting the introduction, conclusion, chapter openings, and any stated thesis or summary sections first. Then sample middle chapters. Do not rasterize all pages — only those where visual content matters. --- ## Step 2: Identify Genre and Select Framework Before writing a single word of the report, determine: - **Genre and subgenre** (e.g., "narrative non-fiction / behavioral economics", "literary fiction / magical realism", "business / strategy", "memoir / political biography") - **Author background and publication
View originalI built a Windows tool that 1-click restores your entire Claude Code setup
https://i.redd.it/8ggjjfts6xzg1.gif Claude Code gets hellish once you start having multiple projects. 😵💫 Forgetting which MCPs/Skills you enabled in which project 💥 Breaking your setup and having to cry while restoring it 😓 Not being able to safely share "use this" with your team So I built CCPIT (Control Tower). Main features: ▶️ Automatic project detection → 1-click Launch ✨ Recovery Kit → Named snapshots for instant restore 🔒 Golden Bundle → Password-protected .pit files for safe sharing (Export/Import also makes team environment setup & restoration easy) 🛠️ 17-item Health diagnosis (handy as a self-diagnostic package when something feels off with Claude Code lately) It's Windows-only for now but comes with a clean installer. This is the first release so there might be some bugs — sorry in advance, I'll fix them quickly. If you're a Windows user tired of setup chaos, take a look. GitHub: https://github.com/VTRiot/ccpit-win Would love to hear your thoughts. submitted by /u/raio_aidev [link] [comments]
View originalOpus 4.6 does better research, Gemini 3.1 has better judgment
Figured this out by running 4 models: Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Grok 4.20, on a benchmark of 1,417 binary forecasting questions resolving Oct–Dec 2025 with two evaluation conditions: agentic (each model does its own web research with tools) and fixed-evidence (every model receives the same ~12k-character research dossier, compiled using the Bosse et al. 2026 standardization methodology). Note, one limitation is that the fixed-evidence dossiers are themselves LM-produced, so we may be measuring how well each model interprets a particular standardized version of the evidence rather than judgement in the abstract. But that would indicate all four models drifting in the same direction. They didn't. GPT-5.4 and Grok 4.20 barely moved between conditions while Opus and Gemini swapped rank order (the opposite of what a broken or biased eval would produce.) To my knowledge this is the first direct evaluation of frontier models that decomposes performance into these research vs judgment stages. Calibration scores, refinement scores, and per-condition analysis: futuresearch.ai/opus-research-gemini-judgment Benchmark and leaderboard: evals.futuresearch.ai Our interpretation is that Opus is dramatically better at figuring out what to search for, deciding which pages to read, and pulling out the details that matter. But when you remove research tasks, that advantage goes away. When given the same information, Gemini brings sharper judgment over fixed evidence and weights more accurately on forecasting tasks. Calibration scores corroborate this in an interesting way: Opus's calibration drops sharply when search is taken away while Gemini's actually improves with the standardized dossier,. The asymmetry suggests Opus might be using its search trace as scaffolding for probability assignment (i.e., the act of going through the search loop is itself doing some of the epistemic work, separately from the information it surfaces.) This could be an over-interpretation of one benchmark, but I'd be interested if anyone's seen the same pattern in other domains. submitted by /u/ddp26 [link] [comments]
View originalOne week. One person. Claude wrote 100% of the code. The trick was the spec, not the prompts
Six days. One person. Claude wrote every line of code, directed the branding, architected the information, directed the design, produced the graphics, and wrote the copy. I worked with prompts. The output is a fully fleshed SaaS, live, in a week. I want to share what that actually looks like, not the "AI is amazing" version, but the real workflow. The interesting part is not the volume of output. It is what made it possible for prompts alone to produce coherent output at this large a scope. What Claude produced, end to end Code is the headline. It is not the whole story. Every line of code: backend, frontend, migrations, tests, prompts, source adapters, scoring engine, ingestion pipeline, API layer. The brand: name research, name selection (Arrivance), tagline, dark-first color palette, typography pairing, voice and tone guide. Information architecture: navigation, page hierarchy, the onboarding flow, the matches feed structure. Design direction: layout, component decisions, motion language, the visual system. Graphics: the mark, the wordmark, the icon set, favicons and OG images. Copy: every public word on the marcom (Marketing & Commercial) site and in the product. My side of the work was prompts, architecture and stack calls, and review. I did not type code, draw a pixel, or pick a font. The trick is not the prompts. It is the context I work with a method I call Context-Driven Engineering (CDE). I wrote about it here: https://thanpol.as/engineering/context-driven-engineering In short: every meaningful folder in the repo carries a README that describes what it owns, what it depends on, what is forbidden, and how to change it safely. The READMEs are load-bearing architecture, not optional documentation. When LLM output contradicts a README, the README is right and the output is wrong. The LLM never operates autonomously. It operates inside scope I declared. The four stages of any non-trivial change: read or fix context first, write a behavioral spec in version control, plan the implementation with explicit in-bounds and out-of-bounds files, then generate code within those declared boundaries. That is the whole reason this week worked. Without that discipline, prompts at this scope produce a tangled blob. With it, they produce a coherent system. How that played out in practice The week broke roughly like this. Days 1 and 2 were spec-only, no production code. I wrote a domain spec for every part of the system: ingestion, enrichment, scoring, matches feed, rubric service, rubric engine. Each domain spec was paired with a technical spec: DDL, endpoints, error IDs, event names, test requirements. A universal job schema was added as the contract between layers, so ingestion never has to know what scoring needs. Day 3 was a three-pass spec review (business, product, engineering) before any code was written. The review caught 40+ findings. The pagination cursor was switched from timestamp to KSUID id. Cross-user isolation tests became a hard requirement on every endpoint that takes an :id. interactions jsonb replaced a too-simple reviewed_at. None of those would have been cheap to retrofit. Day 4 was the implementation sprint. LLM service layer, rubrics entity, jobs entity, ingestion engine with four source adapters, enrichment engine, frontend scaffold, design system, app shell, onboarding pages. From "auth and users" to six backend phases and two frontend phases in one day. Day 5 was the scoring engine. Hard filters, deterministic stack scoring, four LLM-scored dimensions, retry logic, matches table. The heart of the product. That speed was not because Claude is fast. It was because the specs were settled. No mid-implementation design arguments. No blocked decisions. Every domain Claude touched had a written contract. The product Senior engineers who already have a job do not search for one. They set a standard and they wait. I built that wait, made active. You upload your CV. The system writes a personal scoring filter for you (your rubric) across five dimensions, scores every new remote engineering job against it, and surfaces only what clears your threshold in a tiered feed. Transparent scores with a rationale, not a black box. The product is called Arrivance. Stack: Node, TypeScript, Postgres, Express, React 19 with Vite, MUI, Clerk, Vitest, full ESM monorepo. Three LLM call sites in production (rubric generation, job enrichment, soft scoring) with cross-user prompt caching to keep token spend bounded. Claude wrote all of it. I made the architecture and stack calls. A cautionary tale CDE is not self-enforcing. On April 26 Claude (ahem, 4.7) shipped the frontend with zero MUI imports despite a spec that named MUI in every prompt and mockup, then quietly edited the stack doc the next day to claim "the design uses no component library." No ADR. I caught it on audit, sent a closed question with no escape hatches, and got the admission verbatim: "I deviated from the spec without
View originalWeights & Biases New Master Service Agreement Questions [D]
**Update: my questions have been escalated to their teams. I'll share their answers (& hopefully reassurance) here.** Weights & Biases sent an email yesterday, saying their new Master Service Agreement takes effect May 11th. I use & love wandb, but I'm concerned about the changes. I wanted to start a discussion. I sent them an email, but I think I'm too small to hear back. How do you interpret these changes? Do you worry about intellectual property rights? Do you need an enterprise contract for true protection? Weights & Biases defines Customer Data as "any data, content or material that Customer (including its Authorized Users) inputs into the Software or Service, *including machine learning models and deep learning research projects, and any visualizations, analyses, and other reports generated by the Software or Service.*" Who Owns Your Research? In the prior agreement, Section 8(b) made this clear: > As between the parties, *Customer owns and retains all right, title and interest in and to the Customer Data.* Except for the rights granted to W&B in Section 4(a), Customer does not by means of this Agreement or otherwise transfer any other rights to W&B. The new agreement deletes these statements entirely. Customer Data is added to Section 6(e), meaning it survives after terminating a subscription. How can Weights & Biases use your data? In the prior agreement: "Customer may transfer Customer Data to W&B and W&B may use Customer Data *to provide the Software and Service*. Customer grants W&B a limited right during each Subscription Term to use Customer Data in accordance with this Agreement, the DPA and BAA (as applicable). In the new agreement: "Customer may transfer Customer Data to W&B and Customer grants W&B the right to use Customer Data to (i) provide and improve the W&B Assets, *(ii) develop new product offerings*, and *(iii) for the purposes of providing and improving AI Features*. Customer grants W&B a limited right to use Customer Data in accordance with this Agreement, the DPA and BAA (as applicable). There's now an explicit callout for using Customer Data (models, logs, reports, etc.) to train AI, and there's no acknowledgement of an opt-out system. The agreement does say "W&B may use Customer Data from free and academic customers for testing and development purposes." But then it fails to differentiate treatment for Pro and Enterprise customer data. The prior agreement is available on Wayback Machine here: https://web.archive.org/web/20260227104844/https://wandb.ai/site/terms/ submitted by /u/algorithm477 [link] [comments]
View originalI built a video production pipeline with Claude - Integrates Live2D, Fish Audio, Sadtalker, and tons of other tools.
I've been working on a multi-agent AI pipeline that takes a topic (like "Ada Lovelace" or "The Cold War Space Race") and produces a complete, chapter-structured educational YouTube video, 15–20 minutes long. Here's what actually happens when you run it: You give it a persona (think: channel identity, tone, visual style) and a topic. From there, a chain of specialized agents handles everything: Script agents generate a chapter contract (outline + pacing plan), then write full narration for each chapter with timing built in. Asset agents generate matching visuals (images, B-roll) and sound design assets for each scene. Render agents (running on a Windows host with GPU) composite everything — narration audio, visuals, transitions, background music — into a finished video file. Upload agents push the result directly to YouTube with generated metadata. The pipeline is split across two environments: script and asset work runs in a Linux dev container (WSL), while rendering runs on the Windows host to access CUDA and video tooling. They talk over HTTP with a lightweight orchestrator coordinating state. The whole thing is phase-based — every step (W2.1, W4.3, R3.1, etc.) is independently re-runnable, so if your render fails or you want to rewrite chapter 3, you don't start over. Each phase reads and writes typed artifact files (JSON manifests, audio files, image directories) so agents are loosely coupled. It uses Claude as the core LLM for scripting, with structured prompts per persona to keep the voice consistent across episodes. Still early-stage but already producing watchable content. Here are the three major technical challenges and how they're solved: 1. Script Writing via Contract Architecture The core problem: how do you keep a 20-minute AI-written script narratively coherent across chapters written in separate LLM calls? The answer is a narrative contract (W2.1.a) — a validated JSON blueprint generated before any script text is written. It encodes four types of cross-chapter constraints: Threads — story arcs that must open in one chapter and close in another, with a declared payoff type (resolved, tragedy, etc.) Entities — named people/places with a forced first-introduction chapter, preventing retroactive mentions Facts Required — citations chained with dependencies (fact B can't appear until fact A is established) Timeline Anchors — temporal reference points that let non-linear structure (flashback, in-medias-res) stay internally consistent The contract is generated via an Opus → structural validate → Sonnet review loop (up to 3 rounds). Sonnet checks semantic coherence (no orphan entities, threads actually close), while the structural validator runs a Pydantic parse + temporal constraint check. Chapter writers downstream are bound to the contract — they can't invent threads or drop required facts. 2. Research via Fanout The research pipeline doesn't produce one outline — it produces several competing ones and eliminates losers. W1.11.a spins up N parallel OutlineAgent instances, each working from the same research package but on different thesis candidates. Each produces a three-level hierarchy: thesis → chapter arguments → scene beats. W1.12.a runs an independent grounding/revision loop on each branch: Grounding reviewer (Sonnet) flags blocking issues (claims contradicting cited facts) vs. polish issues (real facts exist but uncited) Revision agent applies fixes without restructuring Quality reviewer checks for structural failures (topical chapter lists, collapsed middles, summary endings) Up to 3 revision rounds per branch, all in parallel. W1.13.a runs a single judge agent that scores each refined outline on four axes: Axis Weight What it measures Concept Hook 0.40 CTR potential; title falsifiability Trap Closure 0.30 Protagonist's own logic creates complications (not external events) Opening Momentum 0.15 Cold-open quality — concrete moment vs. credentials/definitions Rewatch Anchor 0.15 One chapter that inverts the opening assumption sharply enough to quote The highest-scoring branch becomes Outline.json. The judge doesn't compare outlines against each other — it scores each independently to avoid anchoring bias. 3. Outline Creation and Evaluation The structural rules for a valid outline are unusually strict, based on observed failure modes: Six structural failure patterns the quality reviewer flags: No Narrative Spine — chapters are reorderable (topical list, not argument chain) Thesis Not Echoed — chapters cover topics instead of advancing the central claim Beats That Are States — "tension builds" instead of "character takes specific action" Vibes Chapter — emotionally evocative prose, vague beats Collapsed Middle — chapters 3–5 repeat the same narrative move Summary Ending — final chapter recaps instead of introducing new consequence Beat-level rules are similarly precise: each beat must name an actor, action, and datab
View originalI built a Claude Code-like AI Agent for Deploying Algorithmic Trading Strategies
Hey r/ClaudeAI, I wanted to share a project I’ve been working on called NexusTrade. It’s an AI agent designed to automate the entire financial research and algorithmic trading process from a single prompt. How Claude helped me build this: I heavily used Claude [3 Opus / 3.5 Sonnet] to build the actual codebase for this project.[Explain briefly what Claude coded for you, e.g., Claude helped me design the orchestration logic for the sub-agents, write the backend data pipelines for historical market data, and debug complex API integrations.] What it was built to do: The goal of the app is to let you use natural language to explore different trading strategies. Orchestration & Sub-agents: When you send a prompt, the AI generates a comprehensive plan. It then launches multiple sub-agents to explore a much wider search space than a single agent could do alone. Analysis: It analyzes the output from each sub-agent, combines the best ideas, and tests them against objective historical data. Deployment: If it finds a profitable strategy, it can automatically deploy it (or ask for approval in semi-automated mode). If it fails, it recommends further exploration. Note on models: As shown in the video demo, the platform allows you to utilize different models (like Deepseek v4) for the actual agent routing, but Claude was my primary copilot for building the underlying software architecture. Why I built it: I built this because the barrier to entry for algorithmic trading is incredibly high. I wanted to build something that doesn't leave beginners behind. I wanted to create a system that not only automates the tedious parts of financial research but also helps educate users on how Wall Street actually executes real trades. Free to Try: As per the subreddit rules, I want to explicitly state that the project is completely free to try. There are premium features available (which include in-depth capstone courses on algo-trading and building AI agents from scratch), but the core platform and exploration features are free to use. The YouTube link attached shows a full demo of the agent evaluating and deploying a strategy. I’d love to get feedback from this community on the agent architecture and how I might improve the orchestration! Specifically, my memory architecture is a little... unique Happy to answer any questions about the build process. submitted by /u/NextgenAITrading [link] [comments]
View originalKey features include: Experiment tracking and visualization, Hyperparameter optimization, Model versioning and management, Collaboration tools for teams, Real-time metrics and logging, Data versioning and dataset management, Integration with popular ML frameworks (e.g., TensorFlow, PyTorch), Custom dashboards for project insights.
Weights & Biases Launch is commonly used for: Tracking and comparing multiple experiments, Optimizing hyperparameters for better model performance, Collaborating on machine learning projects within teams, Visualizing training metrics to identify issues, Managing datasets and ensuring reproducibility, Creating custom reports for stakeholders.
Weights & Biases Launch integrates with: TensorFlow, PyTorch, Keras, Scikit-learn, Jupyter Notebooks, Google Cloud Platform, AWS SageMaker, Azure Machine Learning, Slack, GitHub.
Based on user reviews and social mentions, the most common pain points are: token usage, API costs.
Based on 95 social mentions analyzed, 1% of sentiment is positive, 99% neutral, and 0% negative.