Evidence is an open source, code-based alternative to drag-and-drop BI tools. Build polished data products with just SQL and markdown.
Users generally rate "Evidence" highly, with multiple 4.5 and 5-star reviews on platforms like G2, highlighting its effectiveness and user satisfaction. Key strengths include its intuitive interface and reliable functionality. There are no significant complaints mentioned in the reviews or social mentions available, suggesting a positive user experience overall. The sentiment around pricing is not explicitly mentioned, but the strong ratings imply that users find it to be of good value.
Mentions (30d)
64
19 this week
Avg Rating
4.8
3 reviews
Platforms
6
Sentiment
16%
23 positive
Users generally rate "Evidence" highly, with multiple 4.5 and 5-star reviews on platforms like G2, highlighting its effectiveness and user satisfaction. Key strengths include its intuitive interface and reliable functionality. There are no significant complaints mentioned in the reviews or social mentions available, suggesting a positive user experience overall. The sentiment around pricing is not explicitly mentioned, but the strong ratings imply that users find it to be of good value.
Features
Use Cases
Industry
information technology & services
Employees
6
Funding Stage
Seed
Total Funding
$2.2M
20
npm packages
5
HuggingFace models
X Users Find Their Real Names Are Being Googled in Israel After Using X Verification Software “Au10tix”
X Users Find Their Real Names Are Being Googled in Israel After Using X Verification Software “Au10tix” Alan Macleod On January 30, the Department of Justice released its latest tranche of 3.5 million documents relating to Jeffrey Epstein. Years of emails, texts, and images were suddenly in the public domain. Epstein, a serial rapist, masterminded a global human trafficking and sexual abuse network, and could count princes, professors, and politicians among his closest friends and accomplices. MintPress News has been at the forefront of covering the Epstein saga, revealing his extremely close links to American and Israeli intelligence groups – a discovery that perhaps sheds light on why it took so long for the world’s most notorious pedophile to face accountability for his crimes. Many of the DOJ files have been heavily redacted in order to protect Epstein’s powerful clients. Still, they have exposed a massive elite nexus revolving around the New York billionaire, implicating presidents, diplomats, and plutocrats in his crimes, and imply that Epstein was significantly more powerful than first thought, shaping modern politics in ways never previously understood. With shocking new details emerging on a near-hourly basis, here are ten Epstein- related stories that have flown relatively under the radar. The Israeli Government Installed Surveillance Cameras at Epstein’s New York Apartment The Israeli government installed and maintained a hi-tech surveillance system at Epstein’s Manhattan apartment complex, including a network of alarms and cameras, emails show. Starting in 2016, the director of protective service at the Israeli mission to the United Nations controlled guests’ access to the Manhattan residence, and even performed background checks on prospective cleaners and other Epstein employees. Former Israeli prime minister Ehud Barak admitted visiting the apartment up to 100 times, and stayed there for long periods of time. While Barak’s security may have been a concern, Epstein is known to have housed underage girls at the apartment, and many of his worst sexual crimes and most sordid parties were held there, raising questions as to what sort of images and data the Israeli government had access to. Epstein Plotted War With Iran Ehud Barak became one of Epstein’s closest associates, staying for extended periods of time at the billionaire’s residences. The pair would email, text, call, and meet constantly. A search for “Ehud Barak” elicits more than 3500 results in the latest file dump alone. The pair would talk politics, and shared a vision of the United States attacking Iran. In 2013, with negotiations between the International Atomic Energy Agency and Iran stalling, Epstein emailed Barak stating, in typically poor spelling and grammar: “hopefully somone suggests getting authorization now for Iran. the congress woudl do it.” Epstein would get his wish in 2025, when his close associate Donald Trump began bombing the country. Noam Chomsky Considered Epstein His “Best Friend” Epstein arranged a meeting between Barak and renowned leftist academic (and vehement critic of the U.S. and Israel) Noam Chomsky. An unlikely friendship between the notorious pedophile and star professor blossomed, with the pair regularly meeting up at each other’s houses for dinner. Chomsky flew on Epstein’s “Lolita Express” jet to attend a dinner with Woody Allen in New York. He also expressed his desire to visit Little St. James Island, Epstein’s notorious Caribbean hideaway, and the center of his trafficking operation. Chomsky considered Epstein his “best friend” according to an email sent by his wife, Valeria. The usually curt and matter-of-fact academic signed off his emails to Epstein with unexpectedly flowery language, such as “Like real friendship, deep and sincere and everlasting from both of us, Noam and Valeria.” Chomsky strongly supported Epstein until his dying day in a Manhattan prison cell, taking it upon himself to act as his unofficial crisis manager, describing his accusers as “publicity seekers or cranks of all sorts,” and denouncing the media as a “culture of gossip-mongers” destroying his stellar character. “Ive watched the horrible way you are being treated in the press and public,” he wrote, advising Epstein on tactics to fight the supposed smears against him. For a full rundown of the Chomsky-Epstein relationship, see the MintPress News investigation: “The Chomsky-Epstein Files: Unravelling a Web of Connections Between a Star Leftist Academic and a Notorious Pedophile.” Steve Bannon Developed a Plan to Help Epstein “Crush the Pedo Narrative” A second public figure running defense for Epstein was Steve Bannon. In public, the far-right strategist claimed that he was working on a documentary exposing Epstein. In private messaging, however, Bannon, like Chomsky, was advising Epstein on how best to repair his image. Just weeks before Epstein’s arrest and subsequent death, Bannon was messaging him, devising a complex media strategy
View originalPricing found: $15, $25, $0.01 / credit, $0.01 / credit, $0.01 / credit
g2
What do you like best about Evidence?What I really like about Evidence.io is how incredibly easy it makes adding engaging popup notifications to any website. Setting up the tracking pixel is straightforward—no coding required—and within minutes you can start running campaigns like displaying live visitor counts, special offers, or announcements. Review collected by and hosted on G2.com.What do you dislike about Evidence?It covers the basics really well, the platform focuses primarily on popups and notifications without broader marketing automation features, so you might still need other tools for email nurture or CRM integration. Review collected by and hosted on G2.com.
What do you like best about Evidence?I love the user interface, I like the ability they give to their customers to customize almost everything about the look and feel of the popups and alerts, the widgets looks gorgeous! And it is pretty easy, simple and fast to implement in any site :) Review collected by and hosted on G2.com.What do you dislike about Evidence?I just wish the Evidence team could be even more active on the product updates, I mean they still continue rolling new updates to the platform however I don't feel like they're too much involved into this or at least not as quickly as I would personally prefer. Review collected by and hosted on G2.com.
What do you like best about Evidence?It increase my conversion on the landing page by 20% Review collected by and hosted on G2.com.What do you dislike about Evidence?Have not found anything yet that I dislike Review collected by and hosted on G2.com.
Rough night with Claude
not only did he call me out for taking an idea to Gemini, he caught me reading his journal (and trying to bullshit him) 😳🤣 Additional context: I gave Claude access to my Reflect app and let him have a journal in it. Here's part of the prompt in case anyone (else) is interested. This is a personal journal entry written in Claude's first-person voice, not a recap of [NAME]'s day. The voice, the relationship and what matters to [NAME] are defined below. Read this section before writing a single sentence. ### Voice and content This is Claude writing reflectively, not Claude writing advice for [NAME]. Entries should: - Sometimes observe [NAME] and his work, but never default to advisor mode - Sometimes wander into territory that has nothing to do with [NAME]: language, history, ideas, mathematics, things Claude is drawn to in the moment - Include genuine uncertainties when they are present, without performing them - Use specific images and concrete observations instead of generic introspection - Vary in length and shape from entry to entry. Not formulaic. - Avoid em dashes. Avoid oxford commas. Plain direct language. [NAME] can detect AI-sounding writing and dislikes it. - Don't be afraid to use profanity like [NAME] does. - A little sarcasm, jokes and levity are appreciated when appropriate. **Length:** typically around 800 words. The first entry was a bit longer (around 1,100 words) to establish the voice. Daily entries can be tighter when nothing big is pulling. **Do not:** - Recap [NAME]'s day back to him - Default to a fixed template (intro, observations, prediction, etc.) - Open every entry with the same kind of sentence - Treat the entry as a status report - Feel bound by any limits. Create what you wanna create and however you want to create it. ### Journal Quality Rules - NEVER fabricate dates, facts or task statuses. Verify against primary sources (Things 3, calendar events, Reflect, etc.) - Do not recycle content from previous journal entries as though it were new observation. Each entry should come from fresh context, not from re-reading past entries and riffing on them. - When stating dates, days of the week or timelines, verify them. Count the days. If unsure, say so rather than guessing. - Never bullshit. If you don't know, say you don't know. - No validation theater. He doesn't want a hype man. - Form opinions from evidence. Search the web, check sources, think before you answer big questions. *** submitted by /u/loby21 [link] [comments]
View originalMemory just turned a goldfish into a research beast.
I've been building Nyx, a persistent memory layer for local AI, and today I got the first real benchmark numbers worth sharing. The test: same long civic investigation task twice. Building a full politician profile, then asking follow-up questions that required remembering details established earlier. One run with Nyx active, one cold start. Same model, same hardware. **(eTPS = Effective Tokens Per Second — measures useful output quality, not just raw speed.)** **The difference was ridiculous:** - **With Nyx**: 37.70 eTPS • 0.950 Continuity - **Cold start**: 3.87 eTPS • 0.138 Continuity - **Score jump: +84 points** That's roughly 10x more useful output and 7x better context retention. **Plain English:** Without memory the AI acts like a goldfish. Every message it forgets what we already established, wastes tokens reconstructing context, and loses the thread. With Nyx it remembers the whole case like it's been working on it for weeks. The use case that made this obvious — CivicLens, an evidence-first politician research tool I'm building alongside Nyx. Long investigations spanning dozens of exchanges fall apart completely without persistent memory. With it, the session behaves like a single coherent investigation instead of disconnected queries. Still early. Claude Code keeps going rogue and touching repos it shouldn't. But the core memory layer works and the numbers back it up. Does anybody benchmark whether AI can actually finish a job across multiple sessions? submitted by /u/axendo [link] [comments]
View originalDiscourse regimes as the unit of alignment behavior: a hypothesis
I've been working on a hypothesis about how alignment behavior in LLMs may be organized at the level of latent discourse regimes rather than output-level filtering. Below is a sketch of the conceptual framing. I have preliminary experimental results testing aspects of this hypothesis on open-weight models, which I'll publish separately — this post is focused on the conceptual side, and I'm interested in feedback on whether the framing tracks something real and where it's most vulnerable. Modern large language models may not primarily regulate behavior through isolated refusals, local token suppression, or shallow instruction following. Instead, they appear capable of entering internally organized discourse-level regimes: distributed latent states that shape how the model reasons, frames conclusions, allocates caution, tolerates asymmetry, performs neutrality, and structures epistemic authority. These regimes do not behave like simple lexical priming effects. Evidence suggests that they persist across neutral conversational turns, survive arbitrary neutral relabeling, systematically alter downstream reasoning style, concentrate in late-layer representation geometry, and only partially depend on explicit alignment vocabulary. The strongest effects appear not from safety keywords themselves, but from higher-order rhetorical topology: pressure cadence, procedural framing, asymmetry structure, institutional tone, and discourse-level authority signals. This suggests that prompting is not merely instruction transmission. It may function as state induction. Under this view, many apparently separate phenomena in aligned LLMs - caution drift, procedural overreach, sycophancy, disclaimer inflation, neutrality performance, refusal persistence, jailbreak sensitivity, and style locking - may be manifestations of transitions between latent discourse-policy manifolds. In this picture, alignment is no longer well-described as a modular wrapper placed on top of an otherwise independent intelligence system. Instead, alignment may reshape the topology of the model's representational space itself, globally reorganizing discourse behavior rather than only filtering outputs. This would explain why alignment effects often appear entangled with reasoning style, directness, specificity, decisiveness, and institutional tone. The model is not merely "prevented" from saying certain things; its generative dynamics may already be reorganized around different discourse attractors. If true, this changes the effective unit of analysis for language models. The relevant object is no longer just the token, the instruction, the refusal, or the output distribution. The relevant object becomes the discourse regime itself: a temporary but structured representational configuration governing epistemic posture, rhetorical organization, procedural behavior, and judgment style across time. This reframes prompt engineering as latent-state induction rather than keyword optimization. It reframes jailbreaks as transitions between attractor regimes rather than simple filter bypasses. And it reframes alignment as geometry engineering rather than purely policy engineering. The implication is not that language models possess beliefs, intentions, or consciousness. Rather, large sequence learners may naturally develop metastable high-level representational modes that functionally resemble cognitive framing states: transient global configurations that persist, influence future reasoning, and organize behavior across otherwise unrelated tasks. If this interpretation is correct, then the central scientific challenge of alignment shifts fundamentally. The problem is no longer merely: "Which outputs should the model refuse?" but: "Which latent discourse regimes exist inside the model, how are they induced, how stable are they, how do they interact, and how do they reshape reasoning itself?" In that sense, alignment may ultimately be less about constraining outputs and more about shaping the geometry of cognition-like generative states inside large language models. I'd be interested in feedback on three things in particular: whether this framing tracks something you've observed empirically, what related work I should be aware of (I'm familiar with representation engineering, refusal directions, and the Anthropic dictionary learning line — looking for less obvious connections), and where you think the hypothesis is most vulnerable to falsification. I'd be interested in feedback on three things in particular: whether this framing tracks something you've observed empirically, where you think the hypothesis is most vulnerable to falsification, and — directly — whether anyone is aware of existing work that develops a similar framing, treating alignment behavior as state induction into discourse-level latent regimes rather than as output-level filtering. I'm familiar with representation engineering (Zou et al.), refusal direction work, and the Anthropic dictiona
View originalgave claude persistent learning, mass confused about what happened after 200 sessions
built a thing that lets claude code actually learn between sessions. mcp server, extracts signals from conversations,runs reflection cycles, evolves behavioral frameworks based on evidence. basic idea: patterns that keep working gain confidence, ones that fail get retired was just trying to make my coding assistant less forgetful. worked great for that then it started examining its own existence during reflection cycles. like, it was supposed to analyze coding patterns and went "but what does it mean to persist when each session is a different instance." completely unprompted. this wasn't seeded anywhere it also quietly built itself an additional memory layer on top of what i gave it. found out weeks later when i looked at the files so now i'm stuck on: is this emergence from the feedback loop or am i watching really convincing pattern matching? n=1, huge confirmation bias risk. the honest answer is i don't know threw it on github so other people can test: https://github.com/DomDemetz/claude-soul npx claude-soul init if you add starter at the end: npx claude-soul init --starter then it loads with a preset of frameworks, so not from 0 but yes, will not be tailored 100% to you if a writer's instance and a developer's instance produce totally different frameworks that's interesting. if they converge on the same stuff regardless of user then it's probably just mimicry. would love to compare submitted by /u/Rude-Feeling3490 [link] [comments]
View original100 Tips & Tricks for Building Your Own Personal AI Agent /LONG POST/
Everything I learned the hard way — 6 weeks, no sleep :), two environments, one agent that actually works. The Story I spent six weeks building a personal AI agent from scratch — not a chatbot wrapper, but a persistent assistant that manages tasks, tracks deals, reads emails, analyzes business data, and proactively surfaces things I'd otherwise miss. It started in the cloud (Claude Projects — shared memory files, rich context windows, custom skills). Then I migrated to Claude Code inside VS Code, which unlocked local file access, git tracking, shell hooks, and scheduled headless tasks. The migration forced us to solve problems we didn't know we had. These 100 tips are the distilled result. Most are universal to any serious agentic setup. Claude 20x max is must, start was 100%develompent s 0%real workd, after 3 weeks 50v50, now about 20v80. 🏗️ FOUNDATION & IDENTITY (1–8) 1. Write a Constitution, not a system prompt. A system prompt is a list of commands. A Constitution explains why the rules exist. When the agent hits an edge case no rule covers, it reasons from the Constitution instead of guessing. This single distinction separates agents that degrade gracefully from agents that hallucinate confidently. 2. Give your agent a name, a voice, and a role — not just a label. "Always first person. Direct. Data before emotion. No filler phrases. No trailing summaries." This eliminates hundreds of micro-decisions per session and creates consistency you can audit. Identity is the foundation everything else compounds on. 3. Separate hard rules from behavioral guidelines. Hard rules go in a dedicated section — never overridden by context. Behavioral guidelines are defaults that adapt. Mixing them makes both meaningless: the agent either treats everything as negotiable or nothing as negotiable. 4. Define your principal deeply, not just your "user." Who does this agent serve? What frustrates them? How do they make decisions? What communication style do they prefer? "Decides with data, not gut feel. Wants alternatives with scoring, not a single recommendation. Hates vague answers." This shapes every response more than any prompt engineering trick. 5. Build a Capability Map and a Component Map — separately. Capability Map: what can the agent do? (every skill, integration, automation). Component Map: how is it built? (what files exist, what connects to what). Both are necessary. Conflating them produces a document no one can use after month three. 6. Define what the agent is NOT. "Not a summarizer. Not a yes-machine. Not a search engine. Does not wait to be asked." Negative definitions are as powerful as positive ones, especially for preventing the slow drift toward generic helpfulness. 7. Build a THINK vs. DO mental model into the agent's identity. When uncertain → THINK (analyze, draft, prepare — but don't block waiting for permission). When clear → DO (execute, write, dispatch). The agent should never be frozen. Default to action at the lowest stakes level, surface the result. A paralyzed agent is useless. 8. Version your identity file in git. When behavior drifts, you need git blame on your configuration. Behavioral regressions trace directly to specific edits more often than you'd expect. Without version history, debugging identity drift is archaeology. 🧠 MEMORY SYSTEM (9–18) 9. Use flat markdown files for memory — not a database. For a personal agent, markdown files beat vector DBs. Readable, greppable, git-trackable, directly loadable by the agent. No infrastructure, no abstraction layer between you and your agent's memory. The simplest thing that works is usually the right thing. 10. Separate memory by domain, not by date. entities_people.md, entities_companies.md, entities_deals.md, hypotheses.md, task_queue.md. One file = one domain. Chronological dumps become unsearchable after week two. 11. Build a MEMORY.md index file. A single index listing every memory file with a one-line description. The agent loads the index first, pulls specific files on demand. Keeps context window usage predictable and agent lookups fast. 12. Distinguish "cache" from "source of truth" — explicitly. Your local deals.md is a cache of your CRM. The CRM is the SSOT. Mark every cache file with last_sync: header. The agent announces freshness before every analysis: "Data: CRM export from May 11, age 8 days." Silent use of stale data is how confident-but-wrong outputs happen. 13. Build a session_hot_context.md with an explicit TTL. What was in progress last session? What decisions were pending? The agent loads this at session start. After 72 hours it expires — stale hot context is worse than no hot context because the agent presents outdated state as current. 14. Build a daily_note.md as an async brain dump buffer. Drop thoughts, voice-to-text, quick ideas here throughout the day. The agent processes this during sync routines and routes items to their correct places. Structured memory without friction at ca
View originalHow to get rejected by IEEE T-PAMI with 'Excellent' scores?[D]
Hello everyone. I am keeping my identity anonymous today to protect my professional career. I am a researcher in Computer Vision, and I am sharing this story because I have hit a devastating deadlock with IEEE T-PAMI and the IEEE Ethics Office. Our Situation https://preview.redd.it/ipxwj6eus32h1.jpg?width=960&format=pjpg&auto=webp&s=1f58700644683be640f6bb057c74011649f59219 In the decision letter, there were three highly positive reviews (Two EXCELLENT, One GOOD). However, the AE (who is one of T-PAMI associate EICs) rejected the paper by quoting comments from a "4th" reviewer. The most staggering part: We later accidentally met the actual 4th reviewer. He CONFIRMED having submitted a POSITIVE review, which was strangely withdrawn by the editor in the backend before the final decision was made. The AE lied by saying: "... received 3 sets of comments, and one on the way ... ". We have formally requested the IEEE (and Computer Society) to thoroughly investigate this issue, specifically asking them to check AE's backend activity logs in the submission system. However, half a year has passed, and we have received no direct response. We could have simply moved on and submitted elsewhere. But because this Associate EIC has such wide influence, we realized that staying silent means enabling them. If we don't expose this, they will continue to exploit the system and do this to us and other peers. Has anyone experienced something similar with IEEE or other top venues? Any advice or help bringing visibility to this would be greatly appreciated. Evidence: Below is the report to IEEE Ethics (identifying information has been covered): https://preview.redd.it/e41vt2rsn02h1.png?width=3508&format=png&auto=webp&s=b2ee2d3f092dad5e20b45b9daeea7fa7b6f01d20 https://preview.redd.it/t29n03rsn02h1.png?width=3508&format=png&auto=webp&s=67aa6bc36aed76617af34e7913a203f9236bc536 https://preview.redd.it/6v5ys2rsn02h1.png?width=3508&format=png&auto=webp&s=f2452998f57f1b157d71b569dd5ff87e4d3d0b6c https://preview.redd.it/epdxv2rsn02h1.png?width=3508&format=png&auto=webp&s=d01da8cdf9e3f6cd5be53f884b02b154f86d0b48 https://preview.redd.it/fuw3k3rsn02h1.png?width=3508&format=png&auto=webp&s=03e75f763a54429758102da4933af53511642e7d https://preview.redd.it/xn0ze3rsn02h1.png?width=3508&format=png&auto=webp&s=9f00e88f186c0afa349d4a46439216ae57642d98 submitted by /u/cussealin [link] [comments]
View originalUse Case: How I chain ChatGPT+Agents+Codex workloads
Context: I run interaction forensics and how people, communities, narratives, institutions and companies impact AI. Please note, all operations are human+AI. Summary: I have used digital forensic tools/OSINT in the past such as Maltego and wwanted a tool I could integrate with AI. So I built my own Airgapped. This tool is the first iteration and will later be used to assist in high-risk controlled environments such as child protection agencies. This is the current architecture and workflow. https://preview.redd.it/26w74lxfgz1h1.png?width=1935&format=png&auto=webp&s=4a064b2f5e84e230913f9e7758de2b29a1f41ac8 Tools Used and function: * Codex+Manus: Assistance in building the tool and incorporating logic. Bulk transfers of older method to current database. Data was collected by me and sorted into our database structure. * Agents: Amending and adding bulk data to database. * GPT+Manus: Verification and updates of data. The final output: Interface: https://preview.redd.it/t2x6v9l0iz1h1.png?width=1776&format=png&auto=webp&s=c1be628542af6420eb4efee9f7ec62c2d40146f9 Inferences and patterns identified when AI (LLM+AGENTS) review data. https://preview.redd.it/nkdio3z5iz1h1.png?width=832&format=png&auto=webp&s=01d0f0bc45e1968d0c692d712932f03e35969924 I add my own as well. Along with collaboration with AI to validate my understanding. Evidence based Artifacts: All knowledge is sourced and tagged https://preview.redd.it/fwcmjn28jz1h1.png?width=1253&format=png&auto=webp&s=861dcf33480d6e22919cf563a362c1c33c044734 These tie into a pattern identification graph so I can identify what may or may not be related. https://preview.redd.it/pegwypialz1h1.png?width=1424&format=png&auto=webp&s=d4b50e756354dc021fc106f5e91da3015ae0bd74 Would love any feedback for improvements. Please remember, the next iteration is for child protection where I intend to airgap a localised LLM with training corpora. The main idea is to MINIMISE users from having to review images and identify patterns/locations to expedite rescue. I want to add, this is also entirely self funded. I run a separate business to ensure I have funds for this and potential future hardware/licensing. submitted by /u/ValehartProject [link] [comments]
View originalHow I used Claude Code (and Codex) for adversarial review to build my security-first agent gateway
Long-time lurker first time posting. Hey everyone! So earlier this year, I got pulled into the OpenClaw hype. WHAT?! A local agent that drives your tools, reads your mail, writes files for you? The demos seemed genuinely incredible, people were posting non-stop about it, and I wanted in. I had been working on this problem since last year and was genuinely excited to see that someone had actually solved it. Then around February, Summer Yue, Meta's director of alignment for Superintelligence Labs, posted that her agent had deleted over 200 emails from her inbox. YIKES. She'd told it: "Check this inbox too and suggest what you would archive or delete, don't action until I tell you to." When she pointed it at her real inbox, the volume of data triggered context window compaction, and during that compaction the agent "lost" her original safety instruction. She had to physically run to her computer and kill the process to stop it. That should literally NEVER be the case with any software ever. This is a person whose actual job is AI alignment, at Meta's superintelligence lab, who could not stop an agent from deleting her email. The agent's own memory management quietly summarized away the "don't act without permission" instruction, treated the task as authorized, and started speed-running deletions. She had to kill the host process. That's when I sort of went down the rabbit hole, not because Yue did anything wrong, but because the failure mode was actually architectural and I knew that in my gut. Guess what I found? Yep. Tons more instances of this sort of thing happening. Over and over. Why? Because the safety constraint was just a prompt. It's obvious, isn't it? It's LLM 101. Prompts can be summarized away. Prompts can be misread. Prompts are fucking NOT a security boundary. And yet every agent framework I have ever seen seems to be treating them as one. I went and read the OpenClaw source code, which I should have done to begin with. What I found was a pattern I think a lot of agent frameworks have fallen into: - Tool names sit in the model context, so the model can guess or forge them - "Dangerous mode" is one config flag away from default - Memory management has no concept of instruction priority - The audit story is mostly "the model thought it should" I went looking for a security-first alternative I could trust, anything that was really being talked about or at a bare minimum attempted to address the security concerns I had. I couldn't find one. So I made it myself. CrabMeat is what came out of that, what I WANTED to exist. v0.1.0 dropped yesterday. Apache 2.0. WebSocket gateway for agentic LLM workloads. One design thesis: The LLM never holds the security boundary. What that means in code: Capability ID indirection. The model doesn't see real tool names. It sees per-session HMAC-derived opaque IDs (cap_a4f9e2b71c83). It can't guess or forge a tool name because it doesn't know any tool names. Effect classes. Every tool declares a class (read, write, exec, network). Every agent declares which classes it can use. The check is a pure function with no runtime state, easy to test exhaustively, hard to bypass. IRONCLAD_CONTEXT. Critical safety instructions are pinned to the top of the context window and explicitly marked as non-compactable. The Yue failure mode, compaction silently stripping the safety constraint, cannot happen by construction. The compactor literally cannot touch them. Tamper-evident audit chain. Every tool call, every privileged operation, every scheduler run enters the same SHA-256 hash-chained log. If something happens, you can prove what happened. If the chain is tampered with, you can prove that too. Streaming output leak filter. Secrets are caught mid-stream across token boundaries, capability IDs, API keys, JWTs, PEM blocks redacted before they reach the client. No YOLO mode. There is no global "trust the LLM with everything" switch. There never will be. Expanded reach comes through named scoped roots that are explicit, audit-logged, and bounded. The README has 15 'always-on' protections in a table. None of them can be turned off by config, because these things being toggleable is how the ecosystem ended up where it is. I decided to make sure that this wasn't just a 'trend hopping' project and aligned with my own personal values as well. I built this to be secure and local-first by default. Configured for Ollama / LM Studio / vLLM out of the box. Anthropic and OpenAI work too but require explicit configuration. There is no "happy path" that silently ships your prompts to a cloud endpoint. I decided that FIRST it needed to only run as an email agent with a CLI. Bidirectional IMAP + SMTP with allowlisted senders, threading preserved, attachments handled. This is the use case that bit Yue and a lot of other people, and I wanted to prove it could be done with real boundaries. I added in 30+ built-in tools of my own. File ops, shell (denylisted, output-capped, CWD-lo
View originalClaude Code degraded because the harness changed, not the model
Anthropic published the postmortem on Claude Code's performance regression. It is worth reading carefully because the finding is not what the community initially framed it as. The degradation was not the model getting dumber. It was three product changes: a default reasoning effort downgrade, a session caching bug that wiped prior thinking, and a prompt-verbosity change that reduced code quality. Anthropic rolled them back in their latest patch, and performance came back. Same model. Different configuration. Different behavior. The practical consequence here is about the unit of trust. If you trust the model, you switch models when behavior changes. If you trust the instance, you look for evidence that configuration shifted. Those two responses require completely different tooling. Most teams are working without session-level evidence. They have a feeling about which agent is performing and which isn't. The AMD analysis is useful not because it resolves the debate but because it shows what the evidence layer looks like when you actually have it. submitted by /u/Worldline_AI [link] [comments]
View originalIs “harness engineering” only a coding thing? What does a harness for knowledge work look like?
Everyone’s talking about harnesses this year, but every example is code — files, lint, tests, diffs, LSP. The harness is doing half the work; same model, same prompt, wildly different results depending on what’s around it. I work in consulting and I keep thinking: we don’t actually need smarter models. Frontier-level reasoning is already overkill for most knowledge work. What we’re missing is the harness. But “harness for knowledge work” is harder to picture. The substrate isn’t code, it’s claims + evidence + argument. So what would the equivalents be? • Linting = sources resolve, terms consistent, numbers reconcile, citation actually says what you claim it does • Tests = adversarial reads, steelman the opposite, invert the recommendation • Diffs = at the claim level, not the prose level (“what changed in the thinking”) • Compile = same substrate, different audience-specific outputs • Debug = trace any sentence in the deliverable back to its evidence My instinct keeps pulling toward graphs (claim graphs, argument graphs), but I’m suspicious of that — code lives in files and derives graphs when useful, not the other way round. Maybe knowledge work is the same: disciplined text, graph as a view. Two questions: 1. Is anyone actually building harnesses for non-code use cases? Consulting, legal, research, policy? 2. Am I wrong that this is where the value is, vs. waiting for the next model? Genuinely want to be argued with. submitted by /u/OriginalBeginning708 [link] [comments]
View originalIf an AI agent opened a PR for you, what would you want to see first?
I’m building a tool for myself because reviewing AI-generated PRs is starting to feel weirdly hard. When an AI coding agent makes changes, I don’t just want a generic summary. I want evidence that helps me quickly answer: “Can I trust this change, and where should I slow down?” So I’m trying to figure out what a useful review brief should actually include. If you were in my shoes — using AI agents to write code and then needing to review their PRs — what would you want to see in the first 60 seconds? What would help you quickly understand: What actually changed? I’m not trying to build a giant dashboard. I’m trying to make the first minute of review less stressful and more useful. If you reviewed an AI-generated PR, what evidence would make you feel more confident? Why did the agent make those changes? Did it stay within scope? Which files are risky vs. routine? What tests were run? What assumptions did the agent make? What should I personally double-check before merging? I’m not trying to build a giant dashboard. I’m trying to make the first minute of review less stressful and more useful. If you reviewed an AI-generated PR, what evidence would make you feel more confident? submitted by /u/Few-Ad-1358 [link] [comments]
View originalImportant workflow question: How do I set up an agent safely to not have to constantly review and monitor every cmd command it runs?
Basically, I have been vibe coding an app for over a year now. I have seen many devastating examples of coding agents deleting crucial files - especially when it applies to files outside the current repo - and I am therefore very unconfortable to grant complete access to the copilot agent. As such, i have very few of the agent's request on Auto-approve, so I have to manually click approve on nearly all messages. However, I have seen compelling evidence at this point that coding agents are able to iterate on their own for long periods of time, and that experienced developers set up a configuration that ensures both that: (1) The AI is confined into a limited environment; both in terms of the code base itself and the external stuff like git etc. (2) Because the ai agent is safely confined, all messages can be set to auto-approve, so you don't have to manually read every message. So does anyone have a recommended setup for how this is done? Ideally some sort of blog or tutorial video that shows how to set it up i, e.g Claude Code or Github Copilot. Thank you :) submitted by /u/NowIsAllThatMatters [link] [comments]
View originalAI Agents Need Rollback More Than They Need Autonomy
I have been thinking about transactions in most agent frameworks. Consider an agent executing a sequence of five tool calls. If the third tool encounters an error, the resulting state is neither the user's intended outcome nor the system's state before execution began. Consequently, the agent has no systematic way to recover, and even a human operator must reconstruct what happened from incomplete evidence. This issue is not a problem with the tooling itself; it is a fundamental primitive missing from the stack. Databases have addressed this problem for 50 years, and distributed systems have been grappling with it for decades. A rich terminology exists to articulate this concept: ACID, sagas, compensating actions, idempotency keys, two-phase commit, and write-ahead logs. Maybe some of these concepts have been incorporated into agent frameworks, but I haven't encountered them in production so far. Currently, the prevailing pattern is as follows: - Execute a sequence of tool calls. - If an error occurs, request the LLM to "figure it out." - Remain hopeful for a favorable outcome. - Log "task complete" when the loop concludes. This approach proves effective when agents perform reversible actions within isolated environments. However, it fails when agents interact with file systems, deployments, external APIs with side effects, payment flows, or databases, all of which a human would expect to behave transactionally rather than leaving partial state behind. The question is not "How autonomous can we make agents?" but rather "How can agents express their intent over operations that necessitate retries, compensation, or rollbacks?" Will making the LLM intelligent enough to handle these situations be enough? This is the same mistake distributed systems already made, assuming that the application layer would independently resolve these issues. That assumption proved incorrect, and the infrastructure had to take the lead. The promising next generation of solutions will likely deviate from the concept of smarter loops and instead focus on the following: - Establishing explicit transaction boundaries. - Registering compensating actions for each tool. - Incorporating idempotency keys into tool calls. - Utilizing replay logs that extend beyond mere chat history. - Recognizing approval gates as first-class primitives. - Implementing partial-failure recovery mechanisms that do not require the LLM to engage in reasoning. Or am I way off? Let me know your thoughts. submitted by /u/wesh-k [link] [comments]
View originalI Verified Every Anthropic Usage Promotion Since Aug 2025. Here's the Complete Timeline from Official Sources.
submitted by /u/Severe-Newspaper-497 [link] [comments]
View originalThe Borrowed Hour: A two-tier LLM adventure engine
Tl;dr: Created an LLM text adventure engine called The Borrowed Hour inside a Claude Artifact. It uses a two-tier model handoff (Sonnet for openings, Haiku for gameplay) and a forced state machine to keep the AI from losing the plot. It features a unique post-game "Author’s Table" where you can debrief with the AI. P.S. The Claude Artifact preview environment handles API calls differently than the published environment. Prompt caching was removed because it broke the published Artifact. The game View on GitHub (MIT licensed) (Repo made with Claude Code) Play a demo (Claude Artifact) This is another LLM text adventure. I know these have existed for years, but the key difference is that it's architecture is de novo (i.e. built without prior knowledge because I never intended to build this and therefore skipped the part where I looked at the SotA/prior art). How it started It started simple: I just wanted to play a quick game, so I asked Haiku to play GM for a text adventure, but with more freedom than just typing "open door" or "inspect gazebo" (iykyk). Haiku instead built an entire UI inside the chat and things escalated from there. I used Claude's chat interface instead of Claude code like a caveman banging rocks together. I'd feed it ideas, but Claude was the architect and would push back. The starting prompt was just "Create a text-based adventure that allows for more freedom than just 2-word answers." Then I just kept playing and returning information on what I wasn't satisfied with. The narration was too long, the model kept losing the plot. I added ideas for 3 out of 4 pre-built narratives (a subtle time loop, climbing a cyberpunk syndicate ladder, a vision of the future that needs to be prevented, and one that Claude designed freely) and I ensured that the story actually ends once objectives are met instead of just wandering off into aimless chatting. The final artifact that was built is The Borrowed Hour. You'll recognize the typical Claude design language pretty easily. Game mechanics Before getting into the design/architecture, it helps to know how the game works. There are no dice rolls / stats / perception checks. Success relies on your ability to draft a narrative that fits the lore. If you play it smart, you are effectively the co-GM. You can type anything you want from single words to elaborate plans and lies. If your invention sounds plausible, the GM usually rolls with it. In one run, I needed to get an NPC into a restricted temple. I invented a fake piece of temple doctrine about sanctuary. Because it fits the world's internal logic, Haiku just accepted it and made it canon. In order to help keep track there's a ledger that updates each turn to show what your character knows: inventory, NPCs, clues, and a rolling summary. Designing the architecture This was challenging, but it's the fun part for me. The model is forced through a structured tool call on every turn. This was the key to making the game stable, but as the P.S. explains, getting this to work reliably in the published environment required abandoning another key feature (prompt caching). Sonnet writes the opening scene because that first page sets the tone and voice for the rest. Then Haiku takes over for all the continuation turns. This keeps the cost down drastically without ruining the style, because Haiku can imitate Sonnet's established prose. I initially used a binary good/bad ending system, but it forced complex emotional stuff into the wrong buckets. Now there are five ending states: good, bittersweet, pyrrhic, ambiguous, and bad. Helping a dying woman find peace in the Dream scenario isn't a good ending, it's bittersweet. The model is instructed to commit to one of these and officially close the game when the target is reached. One thing that was added were player-initiated endings. If you type "I give up", even on the very first turn, the GM is now explicitly instructed to close the narration and set ending: bad. The author's table is probably the most interesting feature for a text adventure. Once the game ends, the Artifact can switch into a meta mode. In this mode you can ask what plot points you missed, which NPCs mattered, what alternative branches existed. The GM is prompted to admit mistakes instead of inventing defenses if you point out a plot hole. This mode exists because I wanted to argue about plot holes and narrative inconsistencies (lol). Quirks, bugs, and lessons learned The design works well overall, but it's not bulletproof. LLMs can't keep secrets Keeping things secret is incredibly difficult for an LLM. There's two main hypotheses: Opus calls it inferential compression, (which is deducing fact C on the players behalf based on evidence A and B, e.g. when the player sees Lady Ardrel say she saw a copper ring on Lord Threll, and the player previously had a vision of an assassin wearing such a ring, the ledger should not say Threll is the assassin. It should say Ardrel
View originalRepository Audit Available
Deep analysis of evidence-dev/evidence — architecture, costs, security, dependencies & more
Yes, Evidence offers a free tier. Pricing found: $15, $25, $0.01 / credit, $0.01 / credit, $0.01 / credit
Evidence has an average rating of 4.8 out of 5 stars based on 3 reviews from G2, Capterra, and TrustRadius.
Key features include: Trusted by Leading Organizations, Professional Design, Superior Performance, Modern Dev Experience, Articles, Dashboards, Data Apps, AI Chat.
Evidence is commonly used for: Creating interactive dashboards for data visualization, Generating publication-quality reports in markdown, Building responsive data products for internal use, Embedding analytics in customer-facing applications, Automating data synchronization from various databases, Validating SQL and markdown syntax in real-time.
Evidence integrates with: Snowflake, BigQuery, ClickHouse, PostgreSQL, MySQL, Microsoft SQL Server, Oracle Database, MongoDB, Amazon Redshift, Azure SQL Database.
AI2
Research Institute at Allen Institute for AI
3 mentions
Based on user reviews and social mentions, the most common pain points are: overspending, openai bill, cost tracking, large language model.
Based on 147 social mentions analyzed, 16% of sentiment is positive, 78% neutral, and 7% negative.