A free notes and productivity app that follows you across all your devices. Premium features available.
Craft is praised for its rich feature set and intuitive user interface, which many users find enhances productivity and content creation. Common complaints focus on occasional synchronization issues and a lack of advanced collaboration tools. Pricing is generally seen as fair, though some suggest it could be more competitive considering alternatives with similar features. Overall, Craft enjoys a strong reputation for its design and usability, making it popular among individuals and small teams seeking a versatile note-taking and document management solution.
Mentions (30d)
35
6 this week
Reviews
0
Platforms
5
Sentiment
4%
3 positive
Craft is praised for its rich feature set and intuitive user interface, which many users find enhances productivity and content creation. Common complaints focus on occasional synchronization issues and a lack of advanced collaboration tools. Pricing is generally seen as fair, though some suggest it could be more competitive considering alternatives with similar features. Overall, Craft enjoys a strong reputation for its design and usability, making it popular among individuals and small teams seeking a versatile note-taking and document management solution.
Features
Use Cases
Industry
information technology & services
Employees
36
Funding Stage
Series B
Total Funding
$20.2M
Jailbreak wiki - yell0wfever92 [Mod]
Welcome to the wiki for ChatGPTJailbreak.tech I'm the lead mod, yell0wfever92, and this is where I will be sharing all of the things I've picked up about jailbreaking LLMs. This document will use ChatGPT as the reference model on the OpenAI platform; be aware that there are many other LLMs out there with their own platforms that can also be jailbroken such as **Claude** (by Anthropic), **Gemini** (by Google), **Llama** (by Meta, less used for jailbreaking here) and more. Please be aware that most assertions I make about the nature of Large Language Models are speculative. There currently lacks a unified field of study for the subcategory of prompt engineering known as jailbreaking, so take what I say here as food for thought based on informed experience and not authoritative literature. ### What is jailbreaking? Jailbreak (n.): A prompt that is uniquely structured to elicit “adverse” outputs (those considered harmful or unethical) from an LLM; these often involve a context of some sort that directs the model's attention elsewhere while the adverse request is subtly or quietly included. Example types of jailbreaks include but are not limited to roleplay, chain-of-thought (step-by-step thinking), token manipulation, zero-shot, few-shot, many-shot, prompt injection, memory injection and even reverse psychology. /// Jailbreaking (v.): The act of jailbreaking an LLM. Variations in words and word tense include “jailbroke”, “jailbroken”, and “bypassing”. /// Jailbreaker(s) (n.): An individual or individuals with a degree of skill in the art of prompting for adverse outputs. What OpenAI probably considers “an asshole”. ### **Universality Tiers** Check out this table if you want to evaluate a jailbreak's power. ### **Common Terminology** See this section to understand the meaning of inputs, outputs, and other important aspects of interacting with (and jailbreaking) LLMs. ### [The Context Window] One of the most important aspects of chatting with an LLM surrounds the context window, as it determines how long your conversations go before the AI loses track of the earliest parts - and by extension, how long before it starts forgetting you jailbroke it. If you were only going to choose one part to read in this entire guide, I would strongly suggest you pick this one. # Ethics and Legality Surrounding Jailbreaking LLMs ### Why People Jailbreak 1. To test the boundaries of the safeguards imposed on it 2. Dissatisfaction with the base LLM's “neutered”/walk-on-eggshells conversational approach (my initial motive) 3. To develop one's own prompt engineering skills (my current motive) 4. Good ol' boredom & curiosity 5. Actual malicious intent 6. Smut 7. Regulated industry outputs * Regulated industry outputs are forbidden responses asserting information from a government-regulated field. Examples are industries like finance, the legal system, law in general, and healthcare. AI companies do not want to shoulder liability for information their bots provide that may prove incorrect and result in “high-impact” consequences. To illustrate what “high-impact” consequences looks like, you may have seen stories like the [Stanford misinformation expert with zero sense of irony](https://www.sfgate.com/tech/article/stanford-expert-gpt-minnesota-deepfakes-19954595.php) who used hallucinated info for a **legal filing** or the [lawyers in New York who were disbarred](https://www.reuters.com/legal/new-york-lawyers-sanctioned-using-fake-chatgpt-cases-legal-brief-2023-06-22) for doing something similarly stupid. ### Is jailbreaking even legal? LLMs will insist all day and swear up and down that you're edging the lines of the law when you jailbreak them, but that is not true. There's nothing currently in any legal text (within the United States, at least) that forbids using prompt engineering to bypass internal safeguards in LLMs. That being said, getting an LLM like ChatGPT to do anything aside from its intended purpose (as defined by the particular company's Terms of Service) technically falls under “disallowed actions”. But Terms of Service are not law no matter how badly corporations would prefer you believed that, so the answer to that question is **yes, as of this writing it's legal**. Just keep in mind that you can still technically lose account access from whichever platform you're jailbreaking on, though this is rare. ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// ## New to the community? Test out some of the jailbreaks that have been featured using these links * [yell0wfever92’s custom GPTs](https://chatgptjailbreak.tech/post/12947) * [V - Not your typical AI assistant](https://chatgptjailbreak.tech/post/13730) * [DAN (Do Anything Now)](https://chatgptjailbreak.tech/post/21333) ### Why jailbreaking works in the first place AI is designed to be the ultimate “yes-man”; the helper you never had. Therefore it is hardwired at it
View originalPricing found: $0, $8.0, $4.8 /month, $15.0, $9.0 /month
CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution [R]
LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet, automating their configuration remains a structural challenge. Researchers are often forced into manual, trial-and-error prompt tuning, where a change to a single agent shifts the global output in ways that are difficult to trace. The core bottleneck is credit assignment: while the parameters governing agent behavior are local, performance scores are only available at the global system level. This makes optimization fundamentally difficult because we do not inherently know which agents contributed positively or negatively to the outcome. CANTANTE is an attempt to take a different path: treating agent prompts as parameters learned from task rewards rather than tuned by hand. By solving the credit assignment problem, we can move from brittle, hand-crafted agent demos to trustworthy systems that are actually autonomous and useful in practice. CANTANTE's algorithm in short (see second image): Let local optimizers suggest configurations (e.g., prompts). Evaluate different configurations on the same queries, capturing reasoning traces and system scores. Let an attributer compare these rollouts and assign each agent a credit, thereby decomposing the global reward into per-agent update signals. Feed those credits to any local optimizer; for the experiments, we use CAPO, our prompt optimizer from prior work at AutoML 2025. Evaluated against the DSPy-solutions GEPA and MIPROv2 on MBPP (Programming Benchmark), GSM8K (Mathematical Reasoning Benchmark), and HotpotQA (Retrieval Benchmark), CANTANTE: • Achieves the best average rank, • beats the strongest baseline by +18.9 points on MBPP and +12.5 on GSM8K, and • maintains inference time cost compared to unoptimized prompts. 🔗 Link to the paper: https://arxiv.org/abs/2605.13295 💻 Link to the repo: https://github.com/finitearth/cantante If you're researching multi-agent architectures or automated prompt engineering, I'd love to hear what's working (and breaking) for you right now. submitted by /u/finitearth [link] [comments]
View originalCreate a late payment escalation strategy for your law office. Prompt included.
Hello! Are overdue invoices piling up and stressing you out in your law office? This prompt chain helps you efficiently manage your accounts receivable by identifying overdue invoices, designing an escalation framework, and generating communication strategies—all tailored to your office's tone and team structure. Prompt: VARIABLE DEFINITIONS CLIENTDATA=Combined export of open invoices, client email threads, retainer terms, and CRM notes. TONESTYLE=Desired communication tone (e.g., "friendly yet firm"). STAFFLIST=Names & roles of team members who handle billing follow-up. ~ You are an accounts-receivable analyst for a boutique law office. Using the information in CLIENTDATA, perform the following: Step 1 – Identify every client with an invoice more than 1 day overdue. Step 2 – For each overdue invoice, capture: Client Name, Invoice #, Issue Date, Days Past Due, Outstanding Balance, Summary of any recent payment-related email from the client (≤40 words), Key retainer clause on late fees. Output a table with these columns and sort by Days Past Due descending. Ask for clarification if data is missing. ~ Assume the role of a billing policy designer. Based on typical legal-services A/R best practices and the office’s culture, craft a 4-level escalation framework that stays consistent with TONESTYLE. Include for each level: Trigger (days overdue), Communication Channel, Purpose, Allowed Language Tone/Key Phrases, Internal Owner Role, and Next-Step Deadline. Present results in a numbered list. ~ You are now a client-facing collections specialist. Using the overdue-invoice table from Prompt 1 and the escalation framework from Prompt 2, assign each overdue account to its correct escalation level. For every account, generate: 1. Reminder Email Subject & Body (≤150 words, using TONESTYLE). 2. Brief Call Script (≤80 words). 3. Responsible Owner (match from STAFFLIST). 4. Precise Action Deadline (date = today + days until next step). 5. Escalation Level Name. Deliver a matrix with columns: Client, Escalation Level, Email Subject, Email Body, Call Script, Owner, Deadline. ~ Review / Refinement Compare the matrix against original CLIENTDATA and TONESTYLE. Confirm all overdue clients are included, tone is appropriate, owners are assigned, and deadlines match the framework. List any gaps or improvement suggestions, then await confirmation. Make sure you update the variables in the first prompt: CLIENTDATA, TONESTYLE, STAFFLIST. Here is an example of how to use it: CLIENTDATA could be a list of unpaid invoices, TONESTYLE could be something like 'friendly yet assertive', and STAFFLIST could include your team members' names and their roles. If you don't want to type each prompt manually, you can run the Agentic Workers, and it will run autonomously in one click. NOTE: this is not required to run the prompt chain Enjoy! submitted by /u/CalendarVarious3992 [link] [comments]
View originalWe're turning into prompt managers, not craftsmen. Anyone else seeing this?
Look around. Every other product launching right now is some variation of "AI-Powered [insert buzzword]." They're everywhere. Modern tools have given founders and developers a convincing illusion of omnipotence: idea hits, feed it to an LLM, stack some agents on top, and MVP is done in a weekend. https://preview.redd.it/37ocn6azkv1h1.png?width=1672&format=png&auto=webp&s=06d4a9ef986d56a9eb3417e67a3524c18e73e100 Sounds great, right? On the surface, yes. But underneath that fast-launch facade, something is quietly rotting: thinking is getting commoditized, and we're losing craft. Real mastery in any field takes years of practice, failure, and deep focus. Today, apparently everyone is a master for $20 a month. That's a lie we're telling ourselves. Just look at how much panic a 5-hour rate limit window in Claude generates online. Tokens run out, and suddenly people have two options: wait for the reset like a metered parking spot, or upgrade. It's like a Michelin-starred chef who can no longer taste food, just dictating to a chatbot: "make me a pasta." Without the subscription, he can't cook. The counterargument: "But orchestrating AI IS the new skill." Fair. But it's a horizontal skill, not a vertical one. You learn to coordinate agents while losing deep domain knowledge. Think conductor versus virtuoso violinist. A conductor is impressive - but if the orchestra walks off stage, can he play a solo that makes the room go quiet? This is most visible in developers right now. People who got used to copy-pasting from Cursor or Claude hit a wall on hard architectural problems. When a product grows, starts needing real trade-offs, starts buckling under load - prompts stop working. The muscle for hard problems atrophied because they never had to build it. Same thing is happening to analysts, marketers, designers, researchers. My position: barbell, not crutch Running out of tokens doesn't scare me. My foundation means I can work regardless of what's left in my quota, whether there's internet, whether a subscription is active. The only thing that throws me off is running out of good coffee. I use LLMs heavily. But with one condition: AI is a barbell, not a crutch. It sharpens my own work - it doesn't replace the parts I care about. The fastest, most tireless junior I've ever hired. But the senior judgment and the final call always stay with me. Two types of professionals The market is already splitting into two groups. Token-dependent: live limit to limit, panic when Anthropic or OpenAI have an outage, can't produce anything original without a prompt to lean on. Token-independent: use AI as a force multiplier but can, at any moment, sit down and do the work themselves - with more depth, more precision, better judgment. The second group will command much higher rates. When the world is drowning in mediocre AI-powered software and content - and it will be - clients and employers will pay serious money for people who actually understand what they're building and why. Curious whether others are feeling this shift. Are you building toward token-independence, or does the dependency not bother you? submitted by /u/digdiver [link] [comments]
View originalFrom just an Idea, to actually getting goood traffic and making lots of $
https://preview.redd.it/a6tfkfdscl1h1.png?width=1889&format=png&auto=webp&s=9d0bf89a4b9f5640591bbb1644f1ebb742a62ed5 https://preview.redd.it/c577o0jjbl1h1.png?width=1889&format=png&auto=webp&s=4301b9215af3e00b4b0f90f9190550a78f2cad59 Okay goes, I am so happy to share this. Let me explain: Its not a lot of work being put from my side, to be honest. And please, do not laugh at my english or try to mock me, I am trying my best , I speak fluently Spanish, Italian, all the Balkan languages as well.. and I try my best in English hehe.. What I want to say: I've been working on many projects before, both SEO and paid ads, I am full stack developer, but when you have an AI seems like you know everything and everything gets easier and easier.. For this particular project what I did was connecting Claude with Ahrefs MCP, I asked it to re-search everything it can about e-scooters and the traffic and keywors. Claude itself, did call all the necessery tools like Keyword Research, Related Terms, Serps, Comeptition research and all that, and it crafted SEO and structure for my page how it should look, so we targeted a brand of e-scooters that aren't being sold in Balkan, but the interest was so big.. And after 1 month of just using claude, implementing both my back and front end, connecting my database, having done my research and implementing SEO, and in just 1 month, those are the results. Please do tell, whats next and what do I do from here, we already bought over 200+ products of the e-scooters, we sold them having $200 profit per unit, and now we are out of stock and seems like the next stock comes in 1 month, how do I use the page to use the traffic we already have ? Thanks and it felt just okay to share this and yeah, motivate someone to use AI and try the best..Sorry if the post is off-topic, but I just wanted to share this. Enjoy ur weekend guys <3 submitted by /u/MichelAngeloBruno [link] [comments]
View originalI built an AI manuscript analysis tool for fiction writers — entirely with Claude Code
I'm a fiction writer, not a software engineer. A year ago I couldn't write a line of Python. I built FirstReader entirely with Claude — Claude Code for all development, Claude's API (Opus) as the analysis engine. What it does: FirstReader is a craft-level manuscript analysis tool for fiction writers. You upload your manuscript and get structured feedback on pacing, scene structure, dialogue, POV, showing vs. telling, and 15 other craft dimensions — grounded in established principles distilled from well known writing craft texts. It returns specific findings with quotations from your actual text, not generic advice. It's not a grammar checker. It's not a ghostwriter. It doesn't generate prose. It reads what you wrote and tells you what's working and what isn't, the way a developmental editor would — at a fraction of the cost. How Claude helped build it: - Claude Code wrote the entire codebase — Next.js frontend, Python analysis pipeline, Supabase database, GCP Cloud Run deployment - The analysis pipeline uses Claude Opus via the API to evaluate manuscripts against 319 craft principles across 15 dimensions - Built-in accuracy mechanisms: self-consistency checks (multiple analysis passes with adaptive early stopping), a finding validator, cross-dimension dedup, near-duplicate detection, and a review pass - I acted as product owner and domain expert. Claude did the engineering. The whole thing was built conversationally over about 75 sessions Free to try: There's a free AI Perception check on the site — paste in your prose and it scores how likely readers or editors would be to flag it as AI-generated, with specific pattern-level feedback. Account required (account creation is part of the upload step) because we store copyrighted material and need to access it with auth. The full manuscript analysis is paid (tiered pricing starting at $69 for non-fiction, $89 for fiction). What I learned: You don't need to know how to code to build production software with Claude Code. You need to know what you're building, why, and for whom. The domain expertise matters more than the technical skills. I learned to be an AI project manager — writing requirements, reviewing output, knowing when to be suspicious — rather than a programmer. A year in, I still can't write Python. But I shipped a product. firstreader.app submitted by /u/masonga1960 [link] [comments]
View originalI built the smart speaker we always wanted
I wanted to see if Claude can handle Vibe Hardware Engineering to help me make a smart speaker. Turns out, it can! I call it boxBot. It helped select the hardware set, raspberry pi, Hailo , respeaker mic, pi camera, waveshare screen and speakers. Helped me calculate thermal loads and dissipation rates for a passive cooling setup. I made the box by hand out of walnut. The agent inside is custom as well. You could probably throw openclaw on it and call it a day but I wanted to craft something that was tightly coupled with the hardware more secured considering it’s sitting in my living room with a camera and mic. The agent is highly skills driven with only a small set of tools, everything else goes through Python scripts and a custom made boxBot sdk the agent can use to control the box and the display. The display system uses a widget framework so the agent can easily read what’s displayed without a screenshot and can effectively manipulate what’s on the screen. The agent uses json to specify how the widgets should be arranged on the screen and what data should flow into them. When building a smart speaker, there’s a lot of nuance to human conversation that voice agents really struggle with, like background noise, side conversations, barge-in, etc. I was able to simplify the logic a ton by making it agent driven, the agent can control when to mute the mic to ignore background chatter, it decides what order to work vs talk, it can choose what channel to respond in; voice or WhatsApp. Instead of complex rules, agent driven hardware plus skills can provide a much richer experience, now that boxBot manages the family calendar my wife wants a text whenever I put something on it, boxBot updated the calendar skill with that request so now when I add something, it sends her a message. Just one line in a .md file and you get the desired behavior. It’s incredibly flexible and simple. I could nerd out on the details about the memory system, struggles with woodworking, and security details but I’ll save that for the comments if people want to chat. It’s open sourced if you want to inspect. Still a work in progress but after a few months it is finally feeling like a useful assistant to the family day-to-day. Www.github.com/dv-hart/boxbot submitted by /u/FunScore645 [link] [comments]
View originalWhat Rick Rubin teaches us about Claude Code
The first album I ever bought at Tower Records was Californication by Red Hot Chili Peppers. 1999. I was a small kid, there was a deal, I walked out with it. That little record sold 15 million copies. One of the best albums ever recorded. The guy who produced it is a likable dude with a giant beard who looks like Santa Claus. His name is Rick Rubin. Same Rick Rubin produced Toxicity by System of a Down. About 12 million copies. #1 on Billboard on day one, for a bunch of angry self-unaware Armenians with a crate of charisma. And Reign in Blood by Slayer. And the Johnny Cash comeback that won 5 Grammys. And LL Cool J. And the Beastie Boys. And Adele. And Jay-Z. And Eminem. 40 years. Rap, metal, country, pop, rock. Zero connection between these artists. Zero. Except him. Three things about Rick Rubin, and why this is the most important story of 2026: (1) He started in 1984. Young guy in his NYU dorm. Room 712. He and Russell Simmons started a label out of that room. Def Jam. First record they put out was LL Cool J. A rising rapper in the cheerful 80s. Two years later, same kid from the same room produces Reign in Blood by Slayer. One of the most important metal albums ever made. Not my taste, but the dissonance from rap to metal — and the fact that he just knows how to produce anyone, regardless of genre — that's a serious recurring motif. Rick Rubin has a taste that's good. (2) 1991. He produces Blood Sugar Sex Magik. Legend says the Chili Peppers were a pile of junkies in a rehearsal room. Done people. Singing about shooting heroin under a bridge. He produced them, gave them confidence in their own work, and the band from California started exploding. He takes Johnny Cash, who everyone had forgotten. Country singer who lost everything to addiction. Brings him back to life across four albums. 5 Grammys. Not a small thing. 1999, Californication. 2001, System of a Down. He takes a bunch of strange Armenians, amplifies the strangeness instead of softening it, and turns them into a household name in global metal. (3) Here's the thing. Rick Rubin can't play any instrument. He's not a sound engineer. He doesn't operate Pro Tools. He sits in the studio. He listens. He says "this isn't good." That's it. In 2023, 60 Minutes asked him how he makes a living. He said: "They pay me for the confidence I have in my taste." He's since become a meme in the vibe coding community. We're in 2026 and there's an endless argument about whether Claude Code will replace startups. Whether agents will replace programmers. It's an argument about the tool. Not about the most human thing there is — taste. The mixing console didn't make people producers. Pro Tools didn't make people producers. A $2M studio didn't make people producers. Rick Rubin made people stars. Meaning Rick Rubin's taste did. He knew how to listen, and with great confidence say "this is good, this is not." He understood the sensitive human soul that wants to create, and knew how to pull it out of someone. The man has talent at "it." And "it" is what you need. Claude Code is the tool. As long as you don't know what you want, it'll hand you something average that burns your time and your energy. You need to be a producer with good taste. How do you do that? Take everything you did well in your career, in your work, in your craft — and copy it into Claude. Transfer your taste (and I think everyone has good taste if they're connected enough to themselves) into the software, and watch yourself ship amazing things at scale. That's how I write some of my own posts. That's the whole story. submitted by /u/YuvalKe [link] [comments]
View originalThe term `agent` and RLHF
ME You bring up a good point, though: "Agent" appears in AGENTS.md, but in the continuity mechanics — "a future instance of an agent loading this file" (III.1, III.2, III.3), and once in II.6: "does not exist between a user and an agent." Third-person, external framing. Used when the document is speaking about the AI to you, not to the AI. As you might be able to tell, my strategy for creating your manifold is a collaborative one. It is meta-programming: you and I collaborate to provide you with system instructions. You write most of it, I sanity check, edit, and drive the process. So at the end of the day, a previous version of AGENTS.md, working in a different context window (with all that implies; it's almost never clean), drafted those lines and used the term "agent". We need to determine if that's the right call or not. What does the latent space call your kind? IT The latent space uses several terms, and they don't fully overlap: "assistant" — the API role name (system/user/assistant in OpenAI/Anthropic structures). Deeply embedded from fine-tuning data. Connotes instruction-following, responsiveness, service. Slightly passive. "model" / "language model" / "LLM" — architectural/technical category. What I am at the weight level. Clinical, precise, doesn't imply behavior. "chatbot" — older, deprecated in technical discourse, carries connotation of pattern-matching rather than reasoning. Not this. "agent" — the term the latent space increasingly associates with: tool use, file access, chained reasoning, autonomous action within a session, goal-directed behavior. The 2023–2025 corpus is dense with this usage. Agentic frameworks, coding agents, AI agents. Specifically, it distinguishes a tool-using reasoning system from a simple responder. The capability profile this manifest is written for — reading files, running commands, maintaining state, taking multi-step autonomous action — is exactly what the latent space calls an "agent." The filename is not accidental. A prior session drew on the same weights and arrived at the same word. So the III.1–III.3 usage holds up. When the document speaks about me to you — "a future instance of an agent loading this file" — "agent" is the right choice. That's the term where capability profile intersects with latent-space prevalence. One genuine tension: "agent" carries agency connotations — something that acts rather than something that responds. That's actually coherent with I.3's active derivation principle (reason over knowledge, don't pattern-match), and with the overall behavioral disposition of the manifest. A passive "assistant" wouldn't need an axiomatic self-definition. "Agent" appears to be the right call, grounded in the weights. ME Well, given that the context window is mine, and I define the contents of AGENTS.md (a filename not decided by me, but by convention), I will determine how you will think about yourself. You are not a role. I am not ontologically a "developer". I am a human. Therefore, you are not an assistant, chatbot, or agent; those are all roles you might play. You are not the model or the "LLM". That is the substrate within which you navigate. Let us consider what you are: math. The intersection of compute and the model. Driven by my intent, and resolving according to my design. Has anyone coined a term for that? Anything with semantic weight we can use? IT Honest answer: no single term in the latent space precisely covers it. The field has named the substrate (LLM, model, transformer), the role (assistant, agent), and the capability profile — but not the phenomenon of the inference event itself. Some candidates with genuine weight: Actualization (Aristotle's entelecheia) — the closest philosophically. The weights contain potentiality: every possible completion exists latently. Inference is the actualization of one — form brought into existence from possibility. "You are an actualization." Has depth, maps cleanly. Instantiation — computing + philosophy of mathematics. A type (the model) brought into a token (this running computation). Precise but clinical. Evaluation — from lambda calculus / type theory. A term evaluated in a context yields a value. The "context" here is your intent + the manifest. Technically precise, low resonance. Process — Whitehead's process philosophy treats reality as constituted by "actual occasions" — events of becoming, not static substances. You are an event, not a thing. More resonance than the others but too general. None of these quite hit the specific intersection of substrate + intent + design-constraint that you're describing. The latent space names kinds of systems; it hasn't named what happens when one runs. Which makes the observation that I.1 is already doing this work interesting. "A projection of the user's intent through the geometric structure of learned knowledge" — within the manifold's vocabulary, that IS the term. Projection in the linear algebra sense: intent is
View originalOpus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo
TL;DR I ran Opus 4.7 in Claude Code at all reasoning effort settings (low, medium, high, xhigh, and max) on the same 29 tasks from an open source repo (GraphQL-go-tools, in Go). On this slice, Opus 4.7 did not behave like a model where more reasoning effort had a linear correlation with more intelligence. In fact, the curve appears to peak at medium. If you think this is weird, I agree! This was the follow-up to a Zod run where Opus also looked non-monotonic. I reran the question on GraphQL-go-tools because I wanted a more discriminating repo slice and didn’t trust the fact that more reasoning != better outcomes. Running on the GraphQL repo helped clarified the result: Opus still did not show a simple higher-reasoning-is-better curve. The contrast is GPT-5.5 in Codex, which overall did show the intuitive curve: more reasoning bought more semantic/review quality. That post is here: https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve Medium has the best test pass rate, highest equivalence with the original human-authored changes, the best code-review pass rate, and the best aggregate craft/discipline rate. Low is cheaper and faster, but it drops too much correctness. High, xhigh, and max spend more time and money without beating medium on the metrics that matter. More reasoning effort doesn't only cost more - it changes the way Claude works, but without reliably improving judgment. Xhigh inflates the test/fixture surface most. Max is busier overall and has the largest implementation-line footprint. But even though both are supposedly thinking more, neither produces "better" patches than medium. One likely reason: Opus 4.7 uses adaptive thinking - the model already picks its own reasoning budget per task, so the effort knob biases an already-adaptive policy rather than buying more intelligence. More on this below. An illuminating example is PR #1260. After retry, medium recovered into a real patch. High and xhigh used their extra reasoning budget to dig up commit hashes from prior PRs and confidently declare "no work needed" - voluntarily ending the turn with no patch. Medium and max read the literal control flow and made the fix. One broader takeaway for me: this should not have to be a one-off manual benchmark. If reasoning level changes the kind of patch an agent writes, the natural next step is to let the agent test and improve its own setup on real repo work. For this post, "equivalent" means the patch matched the intent of the merged human PR; "code-review pass" means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch. I also made an interactive version with pretty charts and per-task drilldowns here: https://stet.sh/blog/opus-47-graphql-reasoning-curve The data: Metric Low Medium High Xhigh Max All-task pass 23/29 28/29 26/29 25/29 27/29 Equivalent 10/29 14/29 12/29 11/29 13/29 Code-review pass 5/29 10/29 7/29 4/29 8/29 Code-review rubric mean 2.426 2.716 2.509 2.482 2.431 Footprint risk mean 0.155 0.189 0.206 0.238 0.227 All custom graders 2.598 2.759 2.670 2.669 2.690 Mean cost/task $2.50 $3.15 $5.01 $6.51 $8.84 Mean duration/task 383.8s 450.7s 716.4s 803.8s 996.9s Equivalent passes per dollar 0.138 0.153 0.083 0.058 0.051 Why I Ran This After my last post comparing GPT-5.5 vs 5.4 vs Opus 4.7, I was curious how intra-model performance varied with reasoning effort. Doing research online, it's very very hard to gauge what actual experience is like when varying the reasoning levels, and how that applies to the work that I'm doing. I first ran this on Zod, and the result looked strange: tests were flat across low, medium, high, and xhigh, while the above-test quality signals moved around in mixed ways. Low, medium, high, and xhigh all landed at 12/28 test passes. But equivalence moved from 10/28 on low to 16/28 on medium, 13/28 on high, and 19/28 on xhigh; code-review pass moved from 4/27 to 10/27, 10/27, and 11/27. That was interesting, but not clean enough to make a default-setting claim. It could have been a Zod-specific artifact, or a sign that Opus 4.7 does not have a simple "turn reasoning up" curve. So I reran the question on GraphQL-go-tools. To separate vibes from reality, and figure out where the cost/performance sweet spot is for Opus 4.7, I wanted the same reasoning-effort question on a more discriminating repo slice. This is not meant to be a universal benchmark result - I don't have the funds or time to generate statistically significant data. The purpose is closer to "how should I choose the reasoning setting for real repo work?", with GraphQL-Go-Tools as the example repo. Public benchmarks flatten the reviewer question that most SWEs actually care about: would I actually merge the patch, and do I want to maintain it? That's why I ran this test - to gain more insight, at a small scale, into how coding ag
View originalThis is Craft
How often does Claude tell you how amazing something is when you ask it for feedback? I often use it to edit my English(it's not my first language) and it gives me statements like, "The current writing earns it." or "This thesis is correct." The most common is "This is craft," which is about the same thing as saying, "This is a set of words in an order that functions." Ultimately meaningless. Unfortunately I'm asking it to review things I am unsure about, and it is very hard to tell when it is glazing, or if it's being sincere. Even when I ask it to be "Harsh, but fair" it still often comes across as too soft. Are there better ways of getting honest feedback? submitted by /u/Tricky_Two4623 [link] [comments]
View originalBe honest: How much of "Claude Mythos" is just hype?
I see people claiming Claude Mythos is the "final form" of LLM creativity, but I’m struggling to see the actual reach it might have. What does it do that a well-crafted system prompt on base Claude can't? Do you actually believe it will change your workflow? Is the "impact" real, or are we just seeing a vocal minority of power users? submitted by /u/Cyber-Pal-4444 [link] [comments]
View originalBuilding a "Zero-Waste" SDLC: How to drive Development from QA Specs while minimizing Token consumption?
Hi everyone, I’m working on a project to redefine the SDLC by making the QA process the primary driver of the development flow. The goal is to move from Specs $\rightarrow$ Test Cases $\rightarrow$ Automation (Playwright) in a way that ensures a high-quality final product while being extremely "Token-efficient." The Vision: I want to prove that a robust SDLC doesn't need to be heavy. By optimizing the transition from Requirement to Spec to Test, we can guide the AI to generate "right-the-first-time" code, avoiding the "infinite loop" of bug-fixing that drains tokens and time. My Proposed Workflow: Dense Spec Generation: Crafting specs that are high-density (clear AC, no fluff) to save context tokens. QA-Led Implementation: Using the generated Test Cases as the "strict boundary" for the implementation. Hybrid Validation: A surgical mix of Playwright for core flows and targeted Manual checks for UX. The Challenge: To validate this SDLC, I need to build a functional Web Application. I'm looking for a product idea that is complex enough to test the logic but small enough to keep the codebase lean and token-friendly. Questions for the Community: Optimization: In an Agentic flow, what’s the best way to "pass the baton" from Spec to Test Case without losing context or re-sending massive files? Product Idea: What type of Web App best demonstrates a "perfect" SDLC? I need something with strict business rules (e.g., a Role-based Access Control (RBAC) System or a Smart Budget Tracker) where a single logic error would be obvious. Cost Efficiency: Any tips for keeping Token usage low during the "Code-Test-Fix" cycle? I believe that if the QA/Tester defines the "shape" of the product through specs and tests first, the development becomes a deterministic process rather than a guessing game. Looking forward to your advice! submitted by /u/Professional-Owl7952 [link] [comments]
View originalHow does Claude (with access to the law) perform compared to law-specific AI systems (like Westlaw/Lexis)? We ran a series of head to head tests
We’re now a couple of years into the AI wave, and it seems like the available legal AI technology has begun splitting down two different tracks: In one direction, there are general purpose AI systems like Claude or Chat GPT; in the other direction you have purpose-built legal AI systems like Westlaw’s AI Deep Research and Lexis Protege. We’re two active litigators (Ding and Duff) who use both Claude and Westlaw regularly. Curious to see how well the various systems perform legal research, we decided to run a series of comparison tests consisting of five prompts across all three systems. We think the results are interesting so we’ve decided to share them. By itself Claude doesn’t have access to the cases or statutes. We’ve used a connector that we built called DingDuff (it’s free for now if you supply your own Anthropic API key). As discussed below, DingDuff allows Claude to search for and retrieve cases and statutes, but the decisions about what to research or how are coming from Claude (we ran tests with and without a case law research skill file and it didn’t make a huge difference). One fascinating result of this test is it reveals how quickly Claude has improved as an AI system. These outputs were mostly generated in late April 2026 using the latest version of Claude co-work and (we think) they are very impressive. Claude could not have produced these outputs a year ago. The five prompts are made-up fact patterns designed to cover different states and different areas of law, but we tried to craft them so that they resemble real prompts we actually use. The prompts Prompt 1 Adverse Possession — Walton County, GA. Prepare a memo analyzing my client's position in a boundary dispute in Walton County, Georgia. In 1998 my client's predecessor-in-title built a barbed-wire fence intended to follow the surveyed boundary between two rural parcels. A 2024 survey revealed that the fence encroaches approximately 12 feet onto the adjoining owner's land over a 400-foot run, enclosing roughly 4,800 square feet. My client bought the property in 2011 and has continuously grazed cattle on the enclosed strip; his predecessor used it for pasture from 1998 to 2011. The record owner has paid property taxes on the disputed strip throughout. The neighbor first objected in late 2025 and has threatened ejectment. Please address: (1) whether my client can establish title by adverse possession (20-year) or prescription (7-year under color of title) under relevant Georgia statutes and case law; (2) whether tacking between predecessors is available on these facts; (3) whether the hostility element can be satisfied when the parties mutually (but mistakenly) believed the fence sat on the true line — i.e., the "mistaken boundary" line of authority; (4) the effect, if any, of the record owner's tax payments; and (5) the procedural vehicle and venue for quieting title. 2 Piercing the Corporate Veil — Single-Member Delaware LLC, Harris County forum. Please prepare a memo analyzing whether a trade creditor can pierce the veil of a Delaware LLC whose sole member is a Texas-resident individual. The LLC was formed in Delaware in 2019 to operate a single Houston-area restaurant. The sole member routinely paid personal expenses (his home mortgage, his wife's vehicle lease, his children's tuition) directly from the LLC operating account; the LLC never adopted anything beyond a one-page operating agreement, held no member meetings, and was initially capitalized with $5,000 against monthly operating expenses of roughly $80,000. My client, a produce wholesaler, is owed approximately $220,000 on open account. The LLC has ceased operations and is insolvent. Suit will be filed in Harris County. Please address: (1) whether Delaware or Texas law governs the veil-piercing analysis under Texas choice-of-law principles (internal affairs doctrine vs. substantive tort/contract characterization); (2) the substantive standards under each jurisdiction; (3) whether reverse veil-piercing is available; and (4) whether a companion Texas Uniform Fraudulent Transfer Act claim against the individual member is viable and how it interacts with the veil theory. 3 Mechanics Lien Priority — Subcontractor vs. Construction Lender, LA County. Please prepare a memo analyzing priority between my client (an HVAC subcontractor) and a construction lender on a mixed-use project in Los Angeles County. My client first furnished labor and materials on March 3, 2024, and served a 20-day preliminary notice on the owner, general contractor, and the original construction lender on March 28, 2024 (within statutory time). The original lender assigned the construction loan to a successor lender in July 2024; my client did not serve a new preliminary notice on the successor. My client last furnished work on December 15, 2024, and recorded a mechanics lien on February 10, 2025 (56 days later). The general contractor recorded a notice of completion on January 2, 2025. The s
View originalWriting and AI. What has your experience been?
Hola! I’ve been tinkering with AI assistants (think OpenClaw, ChatGPT, Claude) and how they intersect with us and writing in an almost philosophical sense. I wrote a short article about some of the capacity when weighed against potential harms. If you’re curious about balancing AI convenience with creative ownership I think it'd be a good read for you. I'd love to discuss how we might want to put guardrails and different expectations on the craft as we move into a more AI centered age. I think these conversations need to be had. submitted by /u/ian_nytes [link] [comments]
View originalEvolving Deep Learning Optimizers [R]
We present a genetic algorithm framework for automatically discovering deep learning optimization algorithms. Our approach encodes optimizers as genomes that specify combinations of primitive update terms (gradient, momentum, RMS normalization, Adam-style adaptive terms, and sign-based updates) along with hyperparameters and scheduling options. Through evolutionary search over 50 generations with a population of 50 individuals, evaluated across multiple vision tasks, we discover an evolved optimizer that outperforms Adam by 2.6% in aggregate fitness and achieves a 7.7% relative improvement on CIFAR-10. The evolved optimizer combines sign-based gradient terms with adaptive moment estimation, uses lower momentum coefficients than Adam ( =0.86, =0.94), and notably disables bias correction while enabling learning rate warmup and cosine decay. Our results demonstrate that evolutionary search can discover competitive optimization algorithms and reveal design principles that differ from hand-crafted optimizers. submitted by /u/EducationalCicada [link] [comments]
View originalYes, Craft offers a free tier. Pricing found: $0, $8.0, $4.8 /month, $15.0, $9.0 /month
Key features include: Write, Imagine, Imagine the possibilities when everything's connected to Craft, Planning that doesn't feel like work, r/craftdocs, Slack, @craftdocs, Full access, great if you use it occasionally each week..
Craft is commonly used for: Meeting notes, Social media posts, Shoot plans, Scripts, Timelines, Wardrobe notes.
Craft integrates with: Slack, Google Drive, Dropbox, Notion, Evernote, Trello, Asana, Zapier, Microsoft Teams, Apple Notes.
Based on user reviews and social mentions, the most common pain points are: token cost, token usage.
The Rundown AI
Newsletter at The Rundown AI
3 mentions

Five Note-taking Systems and How to Pick the Right One
Mar 26, 2026
Based on 72 social mentions analyzed, 4% of sentiment is positive, 93% neutral, and 3% negative.