Your collaborative AI assistant to design, iterate, and scale full-stack applications for the web.
"v0" is praised for its rapid prototyping capabilities, with users managing to generate fully functional landing pages in just 90 seconds, indicating its strength in ease of use and speed. While there are no prominent complaints in the available data, a TikTok user emphasizes considerable expenditure in testing similar tools, suggesting cost might be a potential concern for some. Overall, "v0" seems to hold a positive reputation for quickly testing ideas, with pricing details not explicitly discussed in the available reviews and mentions.
Mentions (30d)
26
8 this week
Avg Rating
5.0
1 reviews
Platforms
6
Sentiment
25%
17 positive
"v0" is praised for its rapid prototyping capabilities, with users managing to generate fully functional landing pages in just 90 seconds, indicating its strength in ease of use and speed. While there are no prominent complaints in the available data, a TikTok user emphasizes considerable expenditure in testing similar tools, suggesting cost might be a potential concern for some. Overall, "v0" seems to hold a positive reputation for quickly testing ideas, with pricing details not explicitly discussed in the available reviews and mentions.
Features
Use Cases
Industry
information technology & services
Employees
25
I wasted $500 testing AI coding tools so you don't have to 💸 Here's what actually works: 🧪 Testing ideas? → V0 or Lovable Built a landing page in 90 seconds. Fully clickable, looked real. Code's me
I wasted $500 testing AI coding tools so you don't have to 💸 Here's what actually works: 🧪 Testing ideas? → V0 or Lovable Built a landing page in 90 seconds. Fully clickable, looked real. Code's messy but perfect for validation. 🏗️ Shipping real apps? → Bolt Full dev environment in your browser. I built a document uploader with front end + back end + database in one afternoon. 💻 Coding with AI? → Cursor or Windsurf Cursor = stable, used by Google engineers Windsurf = faster, newer, more aggressive Both are insane. 📚 Learning from scratch? → Replit Best coding teacher I've found. Explains errors, walks you through fixes, teaches as you build. Here's what 500+ hours taught me: The tool doesn't matter if you're using it for the wrong stage. Testing ≠ Building ≠ Coding ≠ Learning Stop comparing features. Match your goal first. Drop what you're building 👇 I'll tell you exactly which tool to use Save this. You'll need it. #AI #AITools #TechTok #ChatGPT #Coding
View originalPricing found: $0 /month, $5, $30 /user, $30, $2
g2
What do you like best about V0?This is great for UI layout design. It also provides a free $5 AI credit limit each month. What I love most is how easily I can clone the UI of any website. Review collected by and hosted on G2.com.What do you dislike about V0?Initially, there were no daily limits, but now the daily limit is 5 chats. Most of the time it shows errors. It also doesn’t preview my React-Native based app. Review collected by and hosted on G2.com.
I built ContextAtlas: A new take on context carry over and helps claude pick up new sessions where it left off in scope of your previous design decisions while saving your tokens avoiding rediscovery
When the "Build with Opus 4.7" hackathon was announced, I had been obsessing over the tokenomics of agents and how to make sessions go further without burning context on rediscovery work. We all have probably hit a session limit and wondered how it went so fast. I applied with that thesis, didn't get in, but I built it anyway over the last four weeks. I am proud to share that v1.0 ships today. Note up front: this is specifically a tool for development users. If you're using claude.ai web or Projects, ContextAtlas won't plug in directly. But if Claude Code is your main work flow or you utilize the Anthropic API, this tool was made for you. The pain: Claude Code learns your codebase fresh every session. "Where is OrderProcessor?" triggers a flurry of greps. "What depends on AuthMiddleware?" is another round of file reads. On a mid-sized codebase, an architectural question can burn 40+ tool calls and a lot of tokens before Claude has enough context to reason well. And the architectural rules in your ADRs and design docs? Claude has no path to those, so it confidently suggests changes that break constraints you may have documented elsewhere in your repo. What I built: ContextAtlas is an MCP server that pre-computes a curated atlas of your codebase (symbols, ADR-extracted architectural intent, git history, test coverage) and serves it to Claude Code in one call at query time in a smaller, token saving compact shape via a few lightweight mcp tools. Initial indexing happens once; querying is local and free. Example of what comes back when Claude calls get_symbol_context("OrderProcessor"): SYM OrderProcessor@src/orders/processor.ts:42 class SIG class OrderProcessor extends BaseProcessor INTENT ADR-07 hard "must be idempotent" RATIONALE "All order processing must be safely retryable." REFS 23 [billing:14 admin:9] GIT hot last=2026-03-14 TESTS src/orders/processor.test.ts (+11) Claude sees the idempotency constraint before proposing changes, not after a review catches the violation. https://i.redd.it/0ons3o28t32h1.gif Numbers: 45-72% token reduction on architectural prompts across three benchmark repos (TypeScript, Python, Go), with zero quality regression on measured axes. Full methodology and paired-t confidence intervals in the linked write-up. I wanted measurements, not vibes. Honest limits: single-judge model at v1.0 (cross-vendor panel is post-launch work). Quantitative claims bounded to three benchmark repos. Tie-bucket and trick-bucket prompts routinely show ContextAtlas net-negative; that's reported inline rather than buried. Install (two ways): In Claude Code: /index-atlas and /generate-adrs skills. No API key needed; runs under your subscription. Via CLI: uses Anthropic API for indexing. npm install -g contextatlas contextatlas init && contextatlas index # then add the MCP server entry to your Claude Code config (snippet in the README) Both produce structurally identical atlases. Supported languages at v1.0: TypeScript (tsserver), Python (Pyright), Go (gopls), Ruby (ruby-lsp). Rust, Java, and C# are next on the roadmap; the adapter interface is small enough that they're realistic community contributions. What's next: v1.1 thesis is shaping up around developer onboarding flows and quality-validation work that was deferred from v0.8. And integrating external documentation of your code base into pre-indexing workflow. Full write-up: https://www.contextatlas.io/blog/v1.0.0 Repo: https://github.com/traviswye/ContextAtlas Also launching on DevHunt today: https://devhunt.org/tool/contextatlas; votes are very appreciated if you find ContextAtlas useful or an interesting approach. Built solo, hackathon-shaped scope, not pretending it's a full blown research paper, but did attempt to treat methodology as seriously. Happy to answer anything in the comments. Star the repo if you want to follow along, file an issue if it breaks for you on your codebase, and please be honest; this only gets better with feedback from people running it on real repos. submitted by /u/Kitchen-Leg8500 [link] [comments]
View originalI built a Laravel package that turns your app into a database-backed personal knowledge vault (Obsidian style) with a 16-tool MCP server
Hey! I'm the author. laravel-commonplace is a database-backed personal knowledge vault you install into an existing Laravel app. Adjacent to Obsidian, Logseq, and Notion as personal-knowledge tooling, except the storage layer is your existing Laravel app's database instead of files on disk or a third-party SaaS. Notes are Eloquent models in your DB, gated by your app's auth, shareable per-user via an owner plus Share model. It ships a browser UI (editor, graph view, search, journal) and an MCP server with 16 tools. If you have a Laravel app, the MCP server lets Claude Desktop, Claude Code, Cursor, Zed, Continue, Cline, Pi, or any other MCP client read and write your notes as the host app's user. Default middleware is auth:sanctum (Bearer PAT), and every tool resolves to $request->user(). There's no synthetic agent identity to provision, scope, or revoke separately. The agent gets exactly what the user gets, evaluated against the same Policies the controllers already use. Session, Passport, and OAuth-DCR are all configurable if PAT isn't what you want. The 16 tools, grouped: CRUD: create-note-tool, read-note-tool, update-note-tool, edit-note-tool (surgical find-and-replace), delete-note-tool (history preserved), move-tool (rewrites referring wikilinks). Discovery: list-tool (folder/tag/visibility filters), search-tool (substring), semantic-search-tool (embedding search), suggested-links-tool (embedding-similar notes not yet linked). Graph: backlinks-tool, neighborhood-tool (N-hop traversal), shortest-path-tool (chain between two notes), hub-notes-tool (most-connected), orphan-notes-tool (no inbound or outbound links). History: history-tool (version snapshots, survives deletion). On the semantic tools: the vector driver defaults to in_php_cosine for portability across SQLite, MySQL, and Postgres. If you're on Postgres, switching to the pgvector driver gets you indexed similarity and removes the in-PHP candidate cap. You swap it with a published migration and an env flag, and the docs recommend it once you're past a couple thousand notes. The tools live in src/Mcp/ if you want to see how a multi-tool MCP server is wired into a Laravel app. Caveats: Pre-1.0 (v0.2.0). APIs may shift before 1.0. Laravel-only by design. The whole point is reusing the host app's DB and auth. MCP is off by default. One env flag turns it on. Operator decision. Prompt injection through note content is the unsolved hard part. Notes are untrusted text, and notes other users share with you can carry instructions an agent might follow. The package doesn't pretend to solve this. The threat model at docs/threat-model.md says what's mitigated and what isn't. No per-tool capability gating yet. Enabling MCP enables all 16 tools the user is otherwise allowed to invoke. It's named as a limitation in the threat model. Feedback I'd actually use: Laravel folks who install it and tell me where it breaks, and anyone who reads the threat model and finds a hole I missed. Repo: https://github.com/non-convex-labs/laravel-commonplace submitted by /u/aaddrick [link] [comments]
View originalHow I used Claude Code (and Codex) for adversarial review to build my security-first agent gateway
Long-time lurker first time posting. Hey everyone! So earlier this year, I got pulled into the OpenClaw hype. WHAT?! A local agent that drives your tools, reads your mail, writes files for you? The demos seemed genuinely incredible, people were posting non-stop about it, and I wanted in. I had been working on this problem since last year and was genuinely excited to see that someone had actually solved it. Then around February, Summer Yue, Meta's director of alignment for Superintelligence Labs, posted that her agent had deleted over 200 emails from her inbox. YIKES. She'd told it: "Check this inbox too and suggest what you would archive or delete, don't action until I tell you to." When she pointed it at her real inbox, the volume of data triggered context window compaction, and during that compaction the agent "lost" her original safety instruction. She had to physically run to her computer and kill the process to stop it. That should literally NEVER be the case with any software ever. This is a person whose actual job is AI alignment, at Meta's superintelligence lab, who could not stop an agent from deleting her email. The agent's own memory management quietly summarized away the "don't act without permission" instruction, treated the task as authorized, and started speed-running deletions. She had to kill the host process. That's when I sort of went down the rabbit hole, not because Yue did anything wrong, but because the failure mode was actually architectural and I knew that in my gut. Guess what I found? Yep. Tons more instances of this sort of thing happening. Over and over. Why? Because the safety constraint was just a prompt. It's obvious, isn't it? It's LLM 101. Prompts can be summarized away. Prompts can be misread. Prompts are fucking NOT a security boundary. And yet every agent framework I have ever seen seems to be treating them as one. I went and read the OpenClaw source code, which I should have done to begin with. What I found was a pattern I think a lot of agent frameworks have fallen into: - Tool names sit in the model context, so the model can guess or forge them - "Dangerous mode" is one config flag away from default - Memory management has no concept of instruction priority - The audit story is mostly "the model thought it should" I went looking for a security-first alternative I could trust, anything that was really being talked about or at a bare minimum attempted to address the security concerns I had. I couldn't find one. So I made it myself. CrabMeat is what came out of that, what I WANTED to exist. v0.1.0 dropped yesterday. Apache 2.0. WebSocket gateway for agentic LLM workloads. One design thesis: The LLM never holds the security boundary. What that means in code: Capability ID indirection. The model doesn't see real tool names. It sees per-session HMAC-derived opaque IDs (cap_a4f9e2b71c83). It can't guess or forge a tool name because it doesn't know any tool names. Effect classes. Every tool declares a class (read, write, exec, network). Every agent declares which classes it can use. The check is a pure function with no runtime state, easy to test exhaustively, hard to bypass. IRONCLAD_CONTEXT. Critical safety instructions are pinned to the top of the context window and explicitly marked as non-compactable. The Yue failure mode, compaction silently stripping the safety constraint, cannot happen by construction. The compactor literally cannot touch them. Tamper-evident audit chain. Every tool call, every privileged operation, every scheduler run enters the same SHA-256 hash-chained log. If something happens, you can prove what happened. If the chain is tampered with, you can prove that too. Streaming output leak filter. Secrets are caught mid-stream across token boundaries, capability IDs, API keys, JWTs, PEM blocks redacted before they reach the client. No YOLO mode. There is no global "trust the LLM with everything" switch. There never will be. Expanded reach comes through named scoped roots that are explicit, audit-logged, and bounded. The README has 15 'always-on' protections in a table. None of them can be turned off by config, because these things being toggleable is how the ecosystem ended up where it is. I decided to make sure that this wasn't just a 'trend hopping' project and aligned with my own personal values as well. I built this to be secure and local-first by default. Configured for Ollama / LM Studio / vLLM out of the box. Anthropic and OpenAI work too but require explicit configuration. There is no "happy path" that silently ships your prompts to a cloud endpoint. I decided that FIRST it needed to only run as an email agent with a CLI. Bidirectional IMAP + SMTP with allowlisted senders, threading preserved, attachments handled. This is the use case that bit Yue and a lot of other people, and I wanted to prove it could be done with real boundaries. I added in 30+ built-in tools of my own. File ops, shell (denylisted, output-capped, CWD-lo
View originalI tracked every dollar I spent on AI coding tools for 60 days and math is uglier than I thought but probably not in the way you'd guess.
Well so I kept telling myself my AI tool spend was fine the way you tell yourself your subscription bloat is fine. vibes-based finance. decided to actually track it. 60 days. every dollar, every tool, every minute I could log honestly. did it for myself, but the numbers are interesting enough I figured I'd share. context: solo dev / freelancer doing mostly web work… react, node, some python. small/mid tier clients. I bill hourly, which means time saved is direct revenue, which is the only reason I'm able to be honest about ROI here. subscriptions I have: cursor pro: $20/mo claude pro + claude code api usage: $110/mo (api was the variable, plus alone is $20) chatgpt plus: $20/mo (mostly inertia at this point, honestly) github copilot: $10/mo coderabbit: $15/mo v0 + occasional one-offs: $25/mo across two months total subscription spend: roughly $200/mo, $400 over period. this is the number people argue about on twitter/X. it is also, I now realize, least interesting number in entire calculation. here’s where it gets interesting: I tracked time spent on three categories: time generating output that ended up in prod: clear win, easy to count, 62 hours over 60 days. at my rate that's a real number time fixing AI output that was wrong but plausible: this is where it got bad. 28 hours. almost half as much time as productive work time switching between tools, debugging specific weirdness and arguing with an agent that was wrong: 14 hours so for every productive hour of AI use, I was burning roughly 40 minutes of overhead. nobody talks about that 40 minutes and depending on the kind of work, it was worse and refactoring legacy code was almost 1:1 productive vs wasted time. this is how I actually saved: I tried to estimate what same work would've taken without AI tools. best estimate: 62 productive hours would've been 110-130 hours without AI assistance. so net savings of 50-70 hours over 60 days. at my hourly rate that pays for the subscriptions many times over. so verdict is yes worth it. but the verdict everyone wants to hear (AI made me 3x faster) is wrong. it's more like 1.7-2x on a generous and that's only after subtracting 42 hours of overhead. line items I'd cut and keep: going through receipts, here's what surprised me: kept: cursor pro, claude code, coderabbit on watch: chatgpt plus (using it less and less, it's basically a habit) cut: copilot (overlaps too much with cursor for my workflow), v0 (only useful for specific work) the surprise was coderabbit, honestly. cheapest line item on my list and one I was most ready to cut going in but when I went back through 60 days of pull requests, the time I would've spent doing my own line by line review of agent output, which I now do religiously after a few burns was massive. an automated first pass cost me $15 and saved probably 6-8 hours of review work over the period. that's highest ROI per dollar of anything on the list, and I almost didn't track it because it felt too small to matter. generation tools are sexier. review tools punch way above their weight when you're using generation tools heavily. that's the actual finding. takeaway nobody put in their twitter thread: most of the cost of AI tools conversation is about the wrong number. subscription cost is rounding error compared to time cost of bad output and the way you minimize that time cost isn't by buying a better generation tool, it's by buying a verification tool to sit on top of whatever you're already using. if I had to start over, I'd buy the cheapest decent generation tool I could find and put my money on the review/verification layer instead that's the inversion of what the marketing tells you to do. tl;dr: tracked AI tool spend for 60 days. subscriptions ($200/mo) were the easy and least interesting number. - real cost was 42 hours of overhead per 60 days of productive use. - real savings were 50-70 hours, which is worth it but it's 1.7-2x not 10x. - biggest surprise was that cheapest tool on my list had highest ROI/ dollar by margin. what's your actual stack costing you, including the time tax? I'm curious if other people who've tracked this seriously are seeing similar overhead numbers or if I'm just bad at this. submitted by /u/thewritingwallah [link] [comments]
View originalClaude skills silently override my instructions, and the surprising pitfalls
So today when working with a Claude skill, I curiously clicked to expand what it was thinking amid the work and spotted this: I need to run the intake step using the ask_user_input_v0 function to gather sources.... The tool has a tight constraint — max 3 questions with 2-4 options each — so I need to be strategic... So it is like, even when Claude needs to ask more than 3 questions or has more than 4 options per question, it will compact them because of the tool's constraints. Further digging and it is correct that ask_user_input_v0 does have those hard limits. But this is not noted or mentioned in places that I could learn. If I didn't see the thinking process, I would never have known it exists. The fix for me was easy: I updated my skill to ask multiple rounds when it needs to. But the bigger questions are: How do I share this to others? Is there any other pitfall when working with Claude skills? So I went deeper to discover more pitfalls. Surprisingly there are more, and they aren't in skill-creator either. For example: Write silently overwrites files on Code/Desktop. create_file refuses to overwrite on Claude.ai. Same instruction, opposite behavior. The officially-recommended references/ pattern is broken — relative paths don't resolve from the skill's directory on any platform. Skills referencing tools that don't exist on the running platform fall back silently to prose. No error. I started a notes repo to store the findings here: https://github.com/livlign/claude-skills-pitfalls Has anyone else hit pitfalls like these? submitted by /u/ahihidummy [link] [comments]
View originalCodex now support 8 hooks - all implemented in Codex CLI Hooks repo
OpenAI shipped PreCompact and PostCompact in Codex CLI v0.129.0, which means the full hook surface is now covered. I put together a repo that wires up all eight. Repo: https://github.com/shanraisshan/codex-cli-hooks submitted by /u/shanraisshan [link] [comments]
View originalI built a sidebar for Claude Code: every prompt clickable, jumps the terminal back to that turn
The why: I run Claude Code in a tmux session on a Linux dev box, SSH'd in from a Windows laptop. The terminal-only flow worked, but I wanted three things tmux alone doesn't give me — clickable prompt history, a file panel next to the terminal so I stop cat-ing things to look at them, and push notifications when Claude is waiting for me without staring at the tab. Existing tools each solve one slice (ttyd = terminal only, filebrowser = files only, code-server is VS Code-shaped and heavy). I wanted them in one page, on every device. Started as a weekend project, ended up as my daily driver. What it is: a single Go binary on your dev box. SSH-tunnel into 127.0.0.1:8080: xterm.js terminal, tmux-backed (survives disconnects, sleeps, server restarts) File tree (preview, drag-drop upload, follows your cd via tmux's pane_current_path — no shell integration needed) Activity panel reads ~/.claude/projects/*.jsonl and shows every prompt. Click one → terminal scrolls back to that turn. Same for Top-bar chips for active model + latest context tokens Push notifications via Claude Code's Stop hook (laptop pings when Claude is idle, even with tab backgrounded) Design decisions worth sharing: tmux is the durability layer. Every session is tmux new-session -A -s {id}. Shell survives WS disconnect, server restart, idle timeout because tmux already solved that. roost owns the WebSocket bridge and an append-only disk log — that's it. Single-user-per-instance, forever. I refuse to add accounts/RBAC. Two people share a host? Each runs their own roost serve on a different port. UNIX UIDs handle isolation. Multi-tenant logic belongs in a reverse-proxy, not the binary. Kept the auth code under 100 lines. Vanilla JS, no build step. Frontend is plain files under //go:embed all:web. No bundler. Easier to debug, easier to ship, lower future cost. One bug worth flagging: tmux's display-message -p '#{x}\x1f#{y}' returns 0x1f as literal _ when tmux is launched without a UTF-8 locale (systemd / launchd units, for example). Burned an hour on this before realising tmux -u is the one-line fix. If you ever pipe tmux through field separators, lock the locale. Validated combo right now: Linux server + Windows Chrome over SSH tunnel. macOS-as-server works but has rough edges. Codex sessions work too if you swap agents. Repo + GIF demo: https://github.com/liamsysmind/roost v0.1.0 tarballs: https://github.com/liamsysmind/roost/releases/tag/v0.1.0 If you drive Claude Code over SSH — what's missing for you? submitted by /u/Adventurous_Sun9149 [link] [comments]
View originalI needed eval data without hallucinations, so I built this with Claude Code
Shipped v0.2.0 today. MIT, public repo. What it is: a tool that generates fake customer conversations with known quality problems planted in them. You give it a seed, it gives you a corpus. Same seed gets you the same structure every time. The LLM only fills in the actual words — it can't make up the facts. Why I built it: I'm working on a customer intelligence platform and I need to test whether my AI scorers actually catch what they're supposed to catch. Can't use real customer data. Two projects got me here. Garry Tan's gbrain-evals showed me this approach could work at all. The actual architecture — deterministic engine owns the truth, LLM only writes prose — comes from a paper called OrgForge by Jeffrey Flynt. Both linked in the README. Built this with Claude Code. I mostly designed in chat and handed off prompts. It's not perfect. Known bugs are filed as public GitHub issues.: The scorer is too forgiving on some failure modes. The prose generator occasionally invents documentation that doesn't exist. 461 tests pass, Python 3.11 and 3.12. https://github.com/ResonantIQ/resonantforge Questions welcome. submitted by /u/SMacKenzie1987 [link] [comments]
View originalClaude Code Prompt Improver v0.5.3 — plan mode readability + subagent-first research
I released v0.5.3 of the Claude Code Prompt Improver today. The project is past 1.4K stars on GitHub. Here is what changed in the v0.5.x releases. Summary New PreToolUse hook adds readability guidance when Claude enters plan mode Vague prompt research now runs in Task/Explore subagents on Haiku instead of the main context Marketplace renamed to severity1-marketplace Windows install works now (python3 || python fallback) What is the plugin? A UserPromptSubmit hook that checks if a prompt is vague before Claude Code runs it. Clear prompts pass through. Vague prompts trigger the prompt-improver skill. The skill researches the codebase and asks 1 to 6 questions using AskUserQuestion. The hook adds about 189 tokens per prompt. Clear prompts do not load the skill. v0.5.3: Plan mode readability Plans got long on revisions. Claude added text like "previously I considered X but rejected it because Y" and the plan grew with each pass. The new hook runs on EnterPlanMode and tells the model: Keep the problem statement, remove decision history On revisions, rewrite the full plan clean. Do not append or annotate. One action per step. Use file paths as anchors like src/auth.ts:42. Use short action steps, not long explanations. v0.5.2: Subagent-first research dispatch When a prompt was vague, the skill called Glob, Grep, WebSearch, and other search tools directly in the main context. This used main-model tokens for search work. Now those tools run through Task/Explore. Explore uses Haiku and a separate context window. The main context only handles git commands, single-file Reads of user-named files, synthesis, and the question to the user. v0.5.1 and v0.5.0: Maintenance Marketplace renamed from claude-code-marketplace to severity1-marketplace Hook command uses python3 || python so it works on Windows CLAUDE.md uses the auto-memory format now Install claude plugin marketplace add severity1/severity1-marketplace claude plugin install prompt-improver@severity1-marketplace Repo: https://github.com/severity1/claude-code-prompt-improver Feedback is welcome, especially on the plan mode guidance wording. submitted by /u/crystalpeaks25 [link] [comments]
View originalParax v0.7: Parametric Modeling in JAX [P]
Hi everyone! Parax is a library for "Parametric modeling" in JAX, attempting to bridge the approach between pure JAX PyTrees, and more object-orientated modeling approaches (e.g. using Equinox). v0.7 has been released, featuring a more polished API as well as some detailed examples in the documentation. Some of Parax's features: Derived/constrained parameters with metadata Computed PyTrees and callable parameterizations Abstract interfaces for fixed, bounded, and probabilistic PyTrees and parameters Two new examples in the docs that show off these features Bounded optimization (JAXopt) Bayesian sampling (BlackJAX) Perhaps the library is of use to someone, and feel free to leave any feedback! Cheers, Gary submitted by /u/gvcallen [link] [comments]
View originalI got tired of the API bills for 100k+ context windows, so I built a persistent O(1) semantic memory state engine to compress history
Hey everyone, The entire industry right now is cheering for massive 1M+ context windows, but I think it's fundamentally the wrong approach. "Just add more RAM" is a trap. Stuffing 100k+ tokens of raw conversation history into a prompt doesn't just burn your API budget; it actually degrades the model's reasoning through the "lost in the middle" effect. I got tired of my AI agents drowning in their own chat histories, so I built an application-layer semantic memory engine called Semvec. The core shift is moving from an O(n) linear history to an O(1) constant-cost semantic state. But compressing chat history is just the baseline. When you treat memory as a fixed-size state vector, it unlocks entirely new architectures for agents that standard RAG or context-stuffing simply can't do: Persistent Coding Agents (MCP Integration) We built an MCP server for Claude Code and Cursor. Instead of dumping 5 whole files into the context window for a refactor, Semvec tracks the architectural invariants and past error patterns across different sessions. It gives your coding agent a persistent "Second Brain"—if it messed up a database schema in session 2, it remembers the "anti-resonance" rule in session 35 so it doesn't make the same mistake. Multi-Agent Swarms (Cortex) If you run multiple agents (like an Analyst and a Critic), they shouldn't have to read each other's 10,000-token transcripts to collaborate. With the Cortex module, agents exchange compressed StateVectorPackets and use a ConsensusEngine to merge their perspectives mathematically, sharing a global state with zero overhead. Enterprise Auditability & GDPR (Compliance Pack) If you run AI memory in production, you need to prove exactly what state the LLM acted on, and you need to be able to legally delete it. The compliance pack handles this via an append-only event store for deterministic replay, HMAC request signing, and GDPR Art. 17 "Right to be Forgotten" workflows with signed deletion certificates. The Benchmark Data: True Constant Cost: We ran a 50,000-turn stress test. While standard baseline history exploded past 75,000+ tokens, Semvec's footprint stayed flat at around ~550-625 tokens per turn. Quality goes UP: Because we strip out the noise and feed the LLM a highly concentrated "essence" of the context, blind A/B LLM-judge scores on LongBench-v2 actually increased for both small models (Llama 3.1-8B) and massive ones (gpt-oss-120B). A quick note on privacy & tracking: When I was initially designing the commercial licensing side, I experimented with an anti-abuse telemetry script to prevent automated clone-training. This was a terrible approach that compromised the local-first nature of the tool. I have completely ripped it out in v0.5.1, all versions containing it are yanked. Semvec for community users is now 100% air-gapped, local, with zero background tracking. The core engine is proprietary/patent-pending to bootstrap the project, but you can pip install the Python SDK and the MCP Server right now for free via the built-in community license. I'd love to hear your thoughts on the O(1) memory architecture vs. Prompt Caching, and if you think bounded semantic states are the future of long-running agents. Docs & Architecture: https://semvec-docs.pages.dev/ PyPI: https://pypi.org/project/semvec/ submitted by /u/scheitelpunk1337 [link] [comments]
View originalthe reason most Claude pipeline failures trace back to the same place (and it's not the model)
a prompt and a skill look identical until something breaks downstream. you build a prompt. it works. you put it in a pipeline, another node calls it. weeks later the pipeline is producing wrong outputs. you dig back through it. the prompt that was "working" was assuming a specific input format nobody documented. it was also returning a structure that only one caller knew how to parse. it worked once because everything aligned. it failed silently forever after because nothing forced the alignment to hold. the difference between a prompt and a skill: the skill has an input contract — specifically what fields it needs, what happens if one is missing, what the minimum viable input looks like. this takes ten minutes to write and prevents a class of failures that would otherwise surface at 2am. the skill has an output schema — what it returns, in what format, with what failure states visible. "returns a summary" is not a schema. a schema says: success = {action: string, confidence: float, reasoning: string}, failure = {action: "skip", reason: string}. two very different things. the skill has a learnings file — what has it failed at, what edge cases have already been found, what broke it in production and how. this fills in over time. every time the skill burns you, the pain goes here instead of being rediscovered by whoever runs it next. the prompt alone is v0. the skill is what you promote to v1. curious what structure your team is using for reusable Claude outputs. whether you did any of this or discovered something else that mattered more. submitted by /u/Most-Agent-7566 [link] [comments]
View originalI trained a NER model on 33,000 Indian Supreme Court judgments (1950–2024) CASE_CITATION hits 97.76% F1, +17 points over the only prior baseline [P]
TL;DR: Released en_legal_ner_ind_trf v0.1 - InLegalBERT fine-tuned on ~34,700 silver-annotated chunks from 33k Indian SC judgments. 13 labels. 78.67% overall F1. CASE_CITATION at 97.76% already exceeds OpenNyAI's PRECEDENT score by +17 points. Free, Apache-2.0. Why this exists OpenNyAI is the only prior Indian legal NER model with any community presence. It's unmaintained and degrades on pre-1990 OCR-era text - the first 40 years of India's constitutional jurisprudence. No replacement existed. Results Entity F1 Support CASE_CITATION 97.76% 3,821 PROVISION 96.35% 20,248 STATUTE 91.94% 8,187 LAWYER 74.67% 3,982 JUDGE 68.06% 1,978 DATE 55.15% 3,289 RESPONDENT 50.44% 1,731 COURT 50.34% 1,033 WITNESS 49.77% 762 OTHER_PERSON 47.11% 4,266 PETITIONER 44.71% 1,573 ORG 41.34% 2,128 GPE 36.56% ⚠ 1,197 micro avg 78.67% 54,195 Evaluated on a held-out validation split (~500 documents, stride=512, non-overlapping). The 25-file locked test set is untouched - head-to-head with OpenNyAI runs in v1.0. Comparison note: OpenNyAI (RoBERTa + transition-based parser, gold-annotated) achieved 91.1% overall strict F1. Not directly comparable - different test sets, different annotation quality, different corpus scope. The +17 point gap on CASE_CITATION is the one apples-to-apples number worth flagging. The annotation pipeline Silver labels from four automatic pipelines merged per document: Regex — 14-pattern citation extractor + statute/provision extractor → CASE_CITATION, STATUTE, PROVISION Metadata projection — case metadata JSONs mapped to character offsets via RapidFuzz → JUDGE, PETITIONER, RESPONDENT Transformer NER — OpenNyAI en_legal_ner_trf, offset-corrected → LAWYER, COURT, ORG, GPE, DATE, OTHER_PERSON, WITNESS Gazetteer — 858 Central Acts with alias resolution → confirms and adds STATUTE spans Trained with Focal Loss (γ=2.0) to handle label imbalance between STATUTE/CASE_CITATION and O tokens. Hardware: Kaggle T4 (free tier). Known weak spots - being honest GPE (36.56%) and ORG (41.34%) are the problem labels. In Indian legal text, "State of Maharashtra" or "Union of India" appear as GPE, PETITIONER, RESPONDENT, or ORG depending on context. A linear token classification head can't resolve overlapping roles. CRF head is v1.0's job. Positional bias - silver training data has repetitive header structures. Performance degrades when parties appear mid-document. Pre-1990 OCR noise - judgments from 1950–1989 vary in quality. Recall drops the further back you go. What's next 300-file gold annotation is in progress (3 volunteers onboard). v1.0 will add a CRF head, run the locked test set, and publish the official head-to-head with OpenNyAI. Model: huggingface.co/evolawyer/inlegalbert-sc-ner-silver Dataset: huggingface.co/datasets/evolawyer/indian-sc-judgments-ner-silver GitHub: github.com/evolawyer/inlegalbert-sc-ner-silver Happy to go deep on the annotation pipeline, conflict resolution between the four label sources, or the Focal Loss setup. submitted by /u/gkv856 [link] [comments]
View originalMahoraga - Stop paying Anthropic and OpenAI so much
Are you sick of paying a million credits per month?!?!? I'm joking, i aint that enthusiastic. But really, this saves me a ton of credits by routing simple tasks to local agents. Clone the repo, fork the repo, star the repo, whatever you want. github.com/pockanoodles/Mahoraga This is Mahoraga, an open-source orchestrator that routes tasks across local and cloud AI agents using a contextual bandit (LinUCB) that learns from every decision. Context (skip): I only started integrating AI into my workflows in late 2025, so I came on the scene broke with no credits. This left me with local models. However, many students and employees also receive credits from their institution to work with. (I got claude yippee) I wanted to be able to flawlessly route between models when credits ran out, which made me build an orchestrator. I used to use claude more as a chatbot/complete workflow engine, which made it difficult to use local models due to the context window, reasoning, etc. Opus 4.5 running open-source "superpowers" ate my usage every month. Now I realize that wasn't an effective way to use claude, or AI in general. I was using claude for both heavy planning/brainstorming and minor tasks. How about tasks specifically for code generation? Code generation is a relatively constrained task, with correct answers and short outputs. Surely local models can compete in tasks that don't need cloud? So I switched Mahoraga to an adaptable router. I ran 192 tasks across 8 agents (4 local Ollama models, 4 cloud CLIs) on a 16GB MacBook Pro, forcing round-robin so every agent got every prompt. Quality is scored by a 4-layer heuristic system (novelty ratio, structural checks, embedding similarity, length ratio). Zero API cost for evaluation, and no LLM-as-judge. Qwen3 4B in nothink mode dominates code and refactor at 33.8 t/s and 6.1s average latency. Cloud agents cluster around 0.650 on code. The local model isn't just cheaper; it's measurably better for this task class. Other findings: LFM2 hits 77.1 t/s but trades ~5 quality points vs Qwen3 4B DeepSeek-R1 averages 123.5s per task on 16GB. The reasoning overhead makes it unusable as a default Security scores are flat at 0.650 across all agents due to my human error—the scorer doesn't capture security-specific signals well. The bandit (LinUCB) is the only routing strategy with sublinear regret (β=0.659) across a 200-task simulation—it actually converges The routing works in two stages: the keyword classifier puts the task in a capability bucket (code, plan, research, etc.), and then the bandit picks the best agent within that bucket. 9-dimensional context vector, persistent state across sessions, warm-start from the compatibility matrix. All local inference, all free. Cloud escalation exists but only fires on retry. Why pay for cloud when a local model handles it better? Looking for any feedback, any input. Feel free to be critical: I appreciate everyone who interacts on this subreddit. I will continue to work on this in the future. Again, this is open source and free. (Mods, please. i'm not making any money off this. submitted by /u/Own-Professional3092 [link] [comments]
View originaleTPS — Effective Tokens Per Second: A Better Way to Measure Local LLM Performance
We're obsessed with raw tokens per second. Every hardware post leads with it. Every quantization comparison is ranked by it. It's the one number everyone agrees to report. It's also measuring the wrong thing. Raw TPS tells you how fast tokens hit the screen. It tells you almost nothing about how quickly you get a correct, usable answer. On sustained, multi-turn workflows, that gap becomes massive. A faster model that hallucinates, requires multiple corrections, and forgets context you gave it earlier can easily be less useful than a slower model that gets it right the first time. eTPS (Effective Tokens Per Second) is a complementary metric that measures actual progress toward a useful answer, not just token throughput. The basic idea: weight the final accepted output by how clean the path to that answer was — first-pass correct scores highest — then divide by total time. Correction loops, hallucinations, and repeated explanations all reduce the score. A response that never reaches a correct answer scores zero regardless of speed. It doesn't replace raw TPS. It sits next to it. Results — same prompt, four runs, same hardware: gemma-4-e2b (4.6B): 53.2 raw TPS → eTPS 53.18 ✓ qwen3.5-0.8b: 173.1 raw TPS → eTPS 86.57 ✗ partial qwen3.5-9b (optimized): 1.8 raw TPS → eTPS 1.78 ✓ qwen3.5-9b (baseline): 0.5 raw TPS → eTPS 0.32 ✗ partial The 0.8B leads on raw speed by a wide margin and still lost. Raw TPS said it won. eTPS said it didn't. Hardware: RTX 5060 Laptop, 8GB VRAM. eTPS scores aren't portable across hardware — always report your full setup. Known limitations (v0.1): Scoring requires human judgment. The line between "needed clarification" and "was factually wrong" isn't always clean. Code generation with objective pass/fail criteria is a cleaner target and the focus of the next benchmark run. One task isn't representative of sustained multi-turn workflows — that's where the metric gets most interesting and where I'm headed next. Easy to game without full system prompt logging. The spec will require it. These are acknowledged constraints, not hidden flaws. Full specification coming soon covering methodology, task library, scoring protocol, and reproducibility standards. Before I lock the final weights I'd genuinely like input on two open questions: How should the penalty differ between a model that confidently states something false versus one that's just vague enough you had to ask a follow-up? And should hardware normalization live in the core formula or be reported separately? Thoughts welcome. submitted by /u/axendo [link] [comments]
View originalYes, v0 offers a free tier. Pricing found: $0 /month, $5, $30 /user, $30, $2
v0 has an average rating of 5.0 out of 5 stars based on 1 reviews from G2, Capterra, and TrustRadius.
Key features include: Sync with a repo, Integrate with apps, Deploy to Vercel, Edit with design mode, Start with templates, Create design systems, Agentic by default, Create from your phone.
v0 is commonly used for: Rapid prototyping of web applications, Creating landing pages for marketing campaigns, Building internal tools for team collaboration, Developing e-commerce websites quickly, Generating APIs for mobile applications, Creating interactive dashboards for data visualization.
v0 integrates with: GitHub, Vercel, Slack, Stripe, Firebase, Twilio, Google Analytics, Zapier, Figma, Notion.
Gary Marcus
Professor Emeritus at NYU
4 mentions
Based on user reviews and social mentions, the most common pain points are: token cost, API bill, token usage, LLM costs.
Based on 68 social mentions analyzed, 25% of sentiment is positive, 65% neutral, and 10% negative.