GitBook is a knowledge platform that connects your docs, product and users, answers user questions, and identifies knowledge gaps. Docs-as-code suppor
Users generally appreciate GitBook AI for its ability to facilitate documentation processes, highlighting its intuitive user interface and efficient collaboration features. However, some users have expressed frustration with certain limitations in customization and integration options. The sentiment around pricing appears neutral, with no significant emphasis on cost-related issues in available mentions. Overall, GitBook AI maintains a positive reputation for enhancing productivity and streamlining content creation, though it is noted that there is room for improvement in its adaptability and feature set.
Mentions (30d)
8
2 this week
Reviews
0
Platforms
2
Sentiment
24%
7 positive
Users generally appreciate GitBook AI for its ability to facilitate documentation processes, highlighting its intuitive user interface and efficient collaboration features. However, some users have expressed frustration with certain limitations in customization and integration options. The sentiment around pricing appears neutral, with no significant emphasis on cost-related issues in available mentions. Overall, GitBook AI maintains a positive reputation for enhancing productivity and streamlining content creation, though it is noted that there is room for improvement in its adaptability and feature set.
Features
Use Cases
Industry
information technology & services
Employees
43
Funding Stage
Seed
Total Funding
$2.1M
Pricing found: $25, $0.20, $65, $249, $12
100 Tips & Tricks for Building Your Own Personal AI Agent /LONG POST/
Everything I learned the hard way — 6 weeks, no sleep :), two environments, one agent that actually works. The Story I spent six weeks building a personal AI agent from scratch — not a chatbot wrapper, but a persistent assistant that manages tasks, tracks deals, reads emails, analyzes business data, and proactively surfaces things I'd otherwise miss. It started in the cloud (Claude Projects — shared memory files, rich context windows, custom skills). Then I migrated to Claude Code inside VS Code, which unlocked local file access, git tracking, shell hooks, and scheduled headless tasks. The migration forced us to solve problems we didn't know we had. These 100 tips are the distilled result. Most are universal to any serious agentic setup. Claude 20x max is must, start was 100%develompent s 0%real workd, after 3 weeks 50v50, now about 20v80. 🏗️ FOUNDATION & IDENTITY (1–8) 1. Write a Constitution, not a system prompt. A system prompt is a list of commands. A Constitution explains why the rules exist. When the agent hits an edge case no rule covers, it reasons from the Constitution instead of guessing. This single distinction separates agents that degrade gracefully from agents that hallucinate confidently. 2. Give your agent a name, a voice, and a role — not just a label. "Always first person. Direct. Data before emotion. No filler phrases. No trailing summaries." This eliminates hundreds of micro-decisions per session and creates consistency you can audit. Identity is the foundation everything else compounds on. 3. Separate hard rules from behavioral guidelines. Hard rules go in a dedicated section — never overridden by context. Behavioral guidelines are defaults that adapt. Mixing them makes both meaningless: the agent either treats everything as negotiable or nothing as negotiable. 4. Define your principal deeply, not just your "user." Who does this agent serve? What frustrates them? How do they make decisions? What communication style do they prefer? "Decides with data, not gut feel. Wants alternatives with scoring, not a single recommendation. Hates vague answers." This shapes every response more than any prompt engineering trick. 5. Build a Capability Map and a Component Map — separately. Capability Map: what can the agent do? (every skill, integration, automation). Component Map: how is it built? (what files exist, what connects to what). Both are necessary. Conflating them produces a document no one can use after month three. 6. Define what the agent is NOT. "Not a summarizer. Not a yes-machine. Not a search engine. Does not wait to be asked." Negative definitions are as powerful as positive ones, especially for preventing the slow drift toward generic helpfulness. 7. Build a THINK vs. DO mental model into the agent's identity. When uncertain → THINK (analyze, draft, prepare — but don't block waiting for permission). When clear → DO (execute, write, dispatch). The agent should never be frozen. Default to action at the lowest stakes level, surface the result. A paralyzed agent is useless. 8. Version your identity file in git. When behavior drifts, you need git blame on your configuration. Behavioral regressions trace directly to specific edits more often than you'd expect. Without version history, debugging identity drift is archaeology. 🧠 MEMORY SYSTEM (9–18) 9. Use flat markdown files for memory — not a database. For a personal agent, markdown files beat vector DBs. Readable, greppable, git-trackable, directly loadable by the agent. No infrastructure, no abstraction layer between you and your agent's memory. The simplest thing that works is usually the right thing. 10. Separate memory by domain, not by date. entities_people.md, entities_companies.md, entities_deals.md, hypotheses.md, task_queue.md. One file = one domain. Chronological dumps become unsearchable after week two. 11. Build a MEMORY.md index file. A single index listing every memory file with a one-line description. The agent loads the index first, pulls specific files on demand. Keeps context window usage predictable and agent lookups fast. 12. Distinguish "cache" from "source of truth" — explicitly. Your local deals.md is a cache of your CRM. The CRM is the SSOT. Mark every cache file with last_sync: header. The agent announces freshness before every analysis: "Data: CRM export from May 11, age 8 days." Silent use of stale data is how confident-but-wrong outputs happen. 13. Build a session_hot_context.md with an explicit TTL. What was in progress last session? What decisions were pending? The agent loads this at session start. After 72 hours it expires — stale hot context is worse than no hot context because the agent presents outdated state as current. 14. Build a daily_note.md as an async brain dump buffer. Drop thoughts, voice-to-text, quick ideas here throughout the day. The agent processes this during sync routines and routes items to their correct places. Structured memory without friction at ca
View originalAdaptive Markdown
I’ve been working on an open-source document format / viewer idea I’m calling Adaptive Markdown. The basic idea is: instead of a document being static text it's controlled by coding agents. You interact with the document more like a live workspace. This has different implications depending on what you are doing. I made a short video demo here: https://youtu.be/H4MnFs8irm8 The thing I’m most excited about is academic / technical reading. In a few years I don’t think people will just read papers passively. I think they’ll translate passages, ask questions, generate examples, explore alternate proofs, run code, attach notes, convert math to Lean when possible, and keep all of that inside the document instead of scattered across chats and notebooks. This is trivial to do inside a browser with coding agent that has access to JS, CSS etc. Some possible use cases I’m thinking about: -Turning articles and books into personalized learning objects - lecture notes with automatically maintained structure -documents with embedded code, tables, consoles, images, audio, or video -AI-generated alt text and descriptions Incorporate Adaptive Markdown into automated work flows eventually, things like automatically recording audio in lectures and taking a picture of a blackboard and turning it into LaTeX notes inside the document It’s very early, but the workflow already feels surprisingly useful to me. GitHub: https://github.com/SemiSimpleMath/Adaptive-Markdown Curious whether this seems useful to anyone else, or whether I’m just overexcited because I built it. So far it's only configured for Anthropic coding-agent SDK, but in couple of days we will have it running on Codex as well. submitted by /u/IDefendWaffles [link] [comments]
View originalAdaptive Markdown
I’ve been working on an open-source document format / viewer idea I’m calling Adaptive Markdown. The basic idea is: instead of a document being static text it's controlled by coding agents. You interact with the document more like a live workspace. This has different implications depending on what you are doing. I made a short video demo here: https://youtu.be/H4MnFs8irm8 The thing I’m most excited about is academic / technical reading. In a few years I don’t think people will just read papers passively. I think they’ll translate passages, ask questions, generate examples, explore alternate proofs, run code, attach notes, convert math to Lean when possible, and keep all of that inside the document instead of scattered across chats and notebooks. This is trivial to do inside a browser with coding agent that has access to JS, CSS etc. Some possible use cases I’m thinking about: -Turning articles and books into personalized learning objects - lecture notes with automatically maintained structure -documents with embedded code, tables, consoles, images, audio, or video -AI-generated alt text and descriptions Incorporate Adaptive Markdown into automated work flows eventually, things like automatically recording audio in lectures and taking a picture of a blackboard and turning it into LaTeX notes inside the document It’s very early, but the workflow already feels surprisingly useful to me. GitHub: https://github.com/SemiSimpleMath/Adaptive-Markdown Curious whether this seems useful to anyone else, or whether I’m just overexcited because I built it. So far it's only configured for Anthropic coding-agent SDK, but in couple of days we will have it running on Codex as well. submitted by /u/IDefendWaffles [link] [comments]
View originalMost multi-agent setups are a room full of people wearing headphones. Here's what I changed.
Most multi-agent setups I've seen are basically a room full of people wearing headphones. Agents running in parallel, no shared awareness, no idea who's doing what. That's not collaboration. That's coexistence. I've been building this in public for almost 12 weeks. 12 agents, 6,500+ tests, 95 stars. Here's what I actually learned. The problem wasn't memory. It was identity. An agent would be technically correct but completely off base. Not hallucinating. Drifting. Like a competent person who walked into the wrong meeting and started contributing without realizing they're in the wrong room. I spent weeks on better memory - longer context, better embeddings, persistent state. None of it fixed the drift. The problem wasn't what the agent remembered - it didn't know who it was. What fixed it was three files. Every agent gets a passport.json - who am I, what I do, what I dont do. Maybe 30 lines. Rarely changes. Then local.json - rolling session log, key learnings, caps at 20 entries and auto-archives to vector search when full. And observations.json - collaboration patterns, how I work with other agents. Identity loads first every session via hooks. Agent never starts cold. I have 12 agents now and each one is a domain specialist. The mail system has 696 tests it built through its own bugs. Routing system is 80+ sessions deep - all it thinks about is routing. They dont do each others jobs. When something breaks in another domain they email each other. The orchestrator dispatches work to them and trusts them because they know their own code better than it does. Every time I post about this someone asks what happens when two agents write the same file. Fair question. They cant. Not as in "we tell them not to" - there's a hook called pre_edit_gate that fires before every write. If an agent in branch A tries to edit a file in branch B's directory, the write gets rejected. Hard block. The agent sees "cross-branch write blocked" and has to either ask a trusted branch to make the change or send a mail request through drone. Only 3 branches in the whole system (the orchestrator, the auditor, and the factory that creates new agents) are allowed to cross-write. Everyone else is physically confined to their own directory. We also lock inboxes - agents cant forge messages by writing directly to another agent's mailbox file. They have to use the mail system. This isnt a convention. Its enforcement. This week I stopped building features and started testing. Took an old MacBook, wiped it, installed Ubuntu from scratch. Cloned on a machine with nothing pre-configured. Found every setup blocker - git config missing, venv broken on fresh Ubuntu, hooks not wired. All fixed now. Install went from ~2GB down to ~100MB. Built a concierge agent that walks new users through onboarding - 12-stage flow, 243 tests on it. First impressions matter and ours was rough ngl. 95 stars. Small project. I'm a solo dev tbh and the agents help build and maintain themselves - every PR is human-AI collaboration. The hardest part hasn't been the code. It's explaining what this actually is. People hear "agents" and expect a task runner. This isnt that. Its infrastructure for building systems that remember and coordinate. What u put on top is up to u. Has anyone else hit the identity drift problem? Genuinely curious how others solved it - or if most just threw more context at it and moved on. submitted by /u/Input-X [link] [comments]
View originalOffload routine Claude Code work to Gemma 4 through the Google GenAI API
The idea of offload-mcp is simple: instead of running hardware-hungry local models for routine work, let Claude offload that work to FREE model APIs and SAVE tokens. I’m using Gemma via the Google GenAI API because I like it in my processing pipelines, but running it locally on my MacBook Air is slow and resource-limited. The API path is much more practical for small jobs. I didn't find any other tool on GitHub or elsewhere to handle that. offload-mcp takes care of commit messages, PR summaries, translations, docstrings, source diff/file summaries, and freeform prompts. Freeform is what I use most: send almost any routine prompt to a cheaper model instead of burning expensive Claude Code or Codex context on it. The source-based mode can read local diffs/files directly through the MCP server and reports estimated primary input tokens avoided. The default model chain uses Gemma, but model IDs are configurable. Curious if this fits anyone else’s Claude workflow! GitHub: https://github.com/peterhadorn/offload-mcp submitted by /u/dd1100 [link] [comments]
View originalLooking for ~10 GMs to alpha test Throughline, an AI tool for running tabletop sessions
Throughline is an AI tool that helps human GMs run tabletop RPG sessions. It heavily uses modern AI (hence posting here), and does not replace any humans. While you're at the table running the game, Throughline listens to your session live and generates scene-beat storyboards (small grids of images showing what the players would encounter if they make a choice) that you can glance at and parse quickly. It also tracks campaign canon across sessions, plants and tracks callbacks, and proposes opening narration when you start a new arc. Throughline does not narrate to your players, run combat, or appear at the table. Players never see anything it produces. The GM does all the live performance: voicing NPCs, improvising, reading the room. The job of Throughline is to handle the long-horizon planning so the GM can focus on running the table. We're at pre-alpha. We've done 6 live playtests plus a lot of internal testing. One-shots have been reliable. Multi-session campaigns are less proven, so we'd suggest starting with a one-shot. We're opening access to about 10 outside GMs to use it for their own sessions and give us feedback. The fit we're looking for is GMs who are strong on the social side of the table (improv, NPC voices, table feel, in-the-moment narration) but who either don't have time to prep extensively or don't have years of practice at long-term narrative planning. If you're already a great GM who enjoys prep and does it well, Throughline probably isn't for you. The product is a web app. You sign in with Google. There's no GitHub or terminal setup. You can run a homebrew world by giving it the lore, or a setting you already love from commonly known books, games, or shows. You'll need a payment method on file because we forward LLM API costs at cost (no markup during alpha). In practice that works out to about $0.50 per hour of live play, so a weekly three-hour session runs around $6 to $10 per month. There's a trial for $5 that should get you a beefy 1-shot. There will be bugs. We want testers who find that interesting rather than frustrating, and who are willing to be in active conversation with us. Design feedback is the main thing we want; we're not looking for early customers or business partners. If you have an eye for game design, that's especially welcome. About the developer: I'm Ted Shachtman, an educator and software engineer. I play Fabula Ultima and D&D, and GM both. The reason I'm building Throughline: a friend of mine, Ben, is a math PhD and the best GM I've played with. He preps three hours per session, voices a dozen NPCs, plans coherent arcs in large worlds, and adapts brilliantly on the fly when the players do something he wasn't planning for. He moved away, and the next best GM in our group is me, and I'm not very practiced nor have the time to prep. I built Throughline so I could be a better GM. We're trying to raise the floor for people who can't prep the way Ben does, so they can still run a session worth playing. If you're interested, you can read more about the system at our website (link in comments) and sign up for the waitlist. The site has a longer writeup of how the system works and the design behind it. I'll respond to everyone within a few days. submitted by /u/Independent-Soft2330 [link] [comments]
View originalCLAUDE.md is the most underused feature of Claude Code — I built a full knowledge management system around it
I've been using Claude Code daily for a few months. Mostly writing code, reviewing PRs, the standard stuff. Then I read Karpathy's brief note about LLM-Wiki and it reframed what I thought Claude Code was actually capable of. The standard pattern: paste context in, get output, session ends, nothing persists. The pattern I've been using: Claude has a permanent role in a specific directory that persists across sessions — not via memory, but via a CLAUDE.md file at the root of the folder that Claude reads at the start of every session. My CLAUDE.md for my Obsidian vault covers: What Claude's role is ("wiki maintainer, not chatbot — never write in a way that requires the human to edit it") The vault folder structure and immutable zones (raw sources are read-only, wiki pages only go in Projects and Areas) Exact page formats for different page types (entity, concept, synthesis, person, summary) The ingest workflow — 7 steps, executed in sequence every time I say "ingest [filename]" The query workflow — read the index first, read relevant pages, synthesise with citations The lint workflow — audit for orphaned pages, dangling wikilinks, missing person pages, stale synthesis pages Session startup ritual — read the schema, read the last 5 log entries, confirm ready With this in place, the experience is different from normal Claude usage: I drop a YouTube transcript into the Resources folder I say "ingest this" Claude asks one classification question (project or area?) Writes a structured summary page Updates all existing concept/entity pages that relate Creates person pages for any significant people mentioned Ensures every [[wikilink]] resolves to an actual file (creates stubs if not) Updates the master index and appends to the activity log After 5 weeks: 148 structured wiki pages. Roman history, architecture, furniture design, client projects, language learning. All cross-referenced. I can ask "what do I know about ergonomics" and get an answer pulling from a furniture design source, a restaurant architecture project, and a book excerpt — because Claude linked them during ingest, not me. The interesting thing about CLAUDE.md vs a system prompt: it's version-controlled with your vault. It's shareable. It evolves like code. Mine is at schema v1.3. When I change the schema, every subsequent session picks up the new behaviour. You can git blame your AI's instructions. I packaged the whole setup — CLAUDE.md schema, PARA vault structure, Claude Code skill — at github.com/Hi7anshu/polymath-vault (npx polymath-world to install). But the pattern is more interesting than the package. Is anyone else building persistent-role systems with CLAUDE.md? Curious what you're using it for and what you've put in yours. I feel like this is one of those things that's in the docs but nobody talks about. submitted by /u/notanaverageindian [link] [comments]
View originalReleased a Claude Code skill that drives a full ML fine-tuning platform end-to-end from one conversation
We released a Claude Code skill today that turns a conversation with Claude into a deployed fine-tuned SLM. Point Claude at a folder of production traces, answer a few questions, approve a handful of checkpoints, and come back to a managed-endpoint deployment (or download the weights and self-host). The skill drives the process and distil labs platform does the heavy lifting under the hood (platform is free for open-source projects — details at the bottom). Concrete walkthrough from our release: 327 noisy production traces from a restaurant-reservation agent (the kind of data you can't train on directly) turn into a fine-tuned Qwen3-1.7B for multi-turn tool-calling. Install ``` curl -fsSL https://cli-assets.distillabs.ai/install.sh | sh distil signup /plugin marketplace add https://github.com/distil-labs/distil-cli-skill /plugin install distil-cli@distil-cli-skill ``` Then point Claude at your traces. What the skill actually does When I asked Claude "what are we about to do?" at the start, it laid out the 5-step workflow without me prompting for any of it: ``` Hey Claude, we will be training an SLM for multi-turn function calling based on the production logs in scenario-2-noisy-labels/. Can you explain the process on a high level? ⏺ 1. Prep. Inspect the traces, then write a converted traces.jsonl (OpenAI message format), a job_description.json, and a config.yaml. Upload & clean. The platform filters low-quality traces and uses a committee of teachers to relabel noisy outputs. Teacher eval. Verify the teacher can actually solve the task before spending hours on training. Train. Teacher generates synthetic data, student learns from it. Analyze & deploy. Compare the tuned student against the original production model, then deploy. ``` Every expensive step has a checkpoint where Claude stops for approval. Training burns credits that aren't easily refundable, so the confirmation checkpoint looked like this: ``` ⏺ Step 8: training confirmation checkpoint. Here's what will run: base.task: multi-turn-tool-calling-closed-book base.student_model_name: Qwen3-1.7B base.teacher_model_name: zai.glm-5 trace_processing.convert_to_single_turn: false trace_processing.teacher_model_name: zai.glm-5 Duration: ~6 hours Cost: burns credits, not easily refunded once started Pipeline: teacher eval → synthetic data gen → student finetune Reply "go" to kick off, or tell me what to change first. lets go! ``` The verdict-before-training gate refuses to even reach that checkpoint if teacher eval can't clear 0.70 on the generated test set. In that case, the skill walks you through iterating the task description instead. The skill also knows task-specific configs. For multi-turn tool-calling, I didn't have to look up that convert_to_single_turn: false is required. Claude flagged it as part of a config sanity check. Every checkpoint leaves a structured markdown analysis report (original-model-analysis.md, teacher-eval-analysis-iter-1.md, training-analysis-iter-1.md). Git-committable, reviewable three weeks later when someone asks why you picked this teacher. What came out A Qwen3-1.7B fine-tuned on ~10k synthetic examples grounded in the noisy traces. Model LLM-as-a-Judge staged_tool_call Function match Qwen3-1.7B (base, untuned) 0.513 0.535 45/78 GLM-5 (744B teacher) 0.808 0.695 69/78 Qwen3-1.7B (tuned) 0.846 0.769 76/78 Deployment Managed OpenAI-compatible endpoint (one-line swap in existing OpenAI client code), or download weights + Modelfile for llama.cpp or vLLM. Skill drives either path. Why it works as a skill Most skills I've seen wrap a few CLI commands but this one is end-to-end: reads your data, writes custom scripts, orchestrates an external platform, interprets the results, and leaves artifacts behind that persist past the conversation. The pattern that worked: Knows the workflow end-to-end and walks you through it Catches edge cases by re-reading the platform's own docs mid-conversation Stops for explicit approval on expensive operations Leaves structured artifacts that outlast the conversation Caveats Training is ~6 hours per run and burns credits (not refundable once started, which is why the confirmation gate exists). Happy to dig into how the checkpoints work, the config-sanity-check logic, or what building a purpose-built skill looked like. submitted by /u/party-horse [link] [comments]
View originalA Debugging Story: Getting Claude Code to Work with Local vLLM When the Docs Don't
https://preview.redd.it/bu8jpmj7n3wg1.png?width=1408&format=png&auto=webp&s=aeb15015eb59632ac8a54bbf39008b335603de2a TL;DR: Every tutorial says "set ANTHROPIC_CUSTOM_MODEL_OPTION and you're done." This is wrong. That config does NOT work for local models. The real solution requires 4 specific settings that no tutorial mentions together. Here's the working config so you don't hit the same blockers. Note on vLLM setup: If you're just getting started with Qwen 3.5 on vLLM (Jinja templates, parser choices, etc.), I documented those issues here: https://www.reddit.com/r/Vllm/comments/1skks8n/qwen_35_27b35ba3b_tool_calling_issues_why_it/ - this post assumes vLLM is already running. The Story (So You Don't Repeat It) I've got Qwen 3.5-27B running on vLLM. Direct API calls work perfectly: curl http://127.0.0.1:8000/v1/chat/completions -X POST \ -d '{"model":"Qwen3.5-27B","messages":[{"role":"user","content":"test"}]}' # ✅ Works So I thought "Claude Code should be easy." Spoiler: It wasn't. After testing multiple configurations and reading through Claude Code's source code, I found the working setup. Here's what actually works. The Trap: The "Obvious" Fix That Doesn't Work What Every Tutorial Tells You The official Claude Code docs say: Use ANTHROPIC_CUSTOM_MODEL_OPTION to add a custom entry to the /model picker. Claude Code skips validation for the model ID set in this variable. So I set it: { "ANTHROPIC_CUSTOM_MODEL_OPTION": "Qwen3.5-27B", "ANTHROPIC_BASE_URL": "http://127.0.0.1:8000" } Result: There's an issue with the selected model (Qwen3.5-27B). It may not exist or you may not have access to it. Why It Doesn't Work The docs are misleading. ANTHROPIC_CUSTOM_MODEL_OPTION: ✅ Adds an entry to the /model picker ❌ Does NOT bypass validation when using --model flag ❌ Does NOT bypass validation when using settings.json ❌ Only works if you manually select it from the picker (which defeats the purpose) This is a known bug documented in GitHub issues #18025, #23266, #34821. But the docs haven't been updated. Lesson: When the official docs don't work, read the source code. The Breakthrough: Reading Source Code Eventually, I gave up on tutorials and started reading Claude Code's cli.js (~50K lines of minified code). I searched for the error message: grep -n "There's an issue with the selected model" ~/.nvm/versions/node/*/lib/node_modules/@anthropic-ai/claude-code/cli.js Found it around line 5146. The relevant code (deobfuscated): if (q instanceof AnthropicError && q.status === 404) { // Reject custom models on 404 return { content: `There's an issue with the selected model (${K}). It may not exist or you may not have access to it.`, error: "invalid_request" } } The real issue: Claude Code makes validation requests, gets 404s from vLLM (because the model name doesn't match Anthropic's hardcoded list), and rejects it before even trying the actual API call. This is client-side validation that happens before any network request to your server. The Actual Fix After testing various environment variables, I found that CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 helps suppress some of these validation checks. This is not documented anywhere but it's critical. This is the line every tutorial misses. The Complete Working Config (Tested, Not Copied) Step 1: ~/.claude/settings.json { "model": "sonnet", "env": { "ANTHROPIC_BASE_URL": "http://127.0.0.1:8000", "ANTHROPIC_AUTH_TOKEN": "dummy", "ANTHROPIC_DEFAULT_OPUS_MODEL": "Qwen3.5-27B", "ANTHROPIC_DEFAULT_SONNET_MODEL": "Qwen3.5-27B", "ANTHROPIC_DEFAULT_HAIKU_MODEL": "Qwen3.5-27B", "API_TIMEOUT_MS": "3000000", "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1", "CLAUDE_CODE_ATTRIBUTION_HEADER": "0" } } The 4 critical lines (get any wrong = errors): Line Why It Matters What Happens If Wrong "model": "sonnet" + ANTHROPIC_DEFAULT_SONNET_MODEL Use alias AND map it (both required) Validation rejects custom names OR Claude doesn't know what "sonnet" means ANTHROPIC_BASE_URL: :8000 Root endpoint, not /v1 Double /v1/v1/messages = 404 CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC: "1" Suppresses client-side validation Intermittent validation failures Step 2: vLLM Setup Assumes vLLM is already running (covered in Part 1). Just ensure: --served-model-name Qwen3.5-27B matches settings.json exactly No / in the model name vLLM is accessible at http://127.0.0.1:8000 Step 3: Test claude "test" # ✅ "I'm ready to help! How can I assist you today?" If this fails, one of the 4 critical lines is wrong. Check them in order. My Complete Debugging Journey (So You Don't Repeat It) Attempt 1: vLLM Official Docs "ANTHROPIC_BASE_URL": "http://127.0.0.1:8000/v1" // ❌ Error: API Error: 404 Why: Docs don't mention Claude adds /v1/messages automatically. Double /v1 breaks everything. Attempt 2: GitHub Issue #18025 "model": "Qwen3.5-27B" // ❌ Error: There's an issue with the selected model Why: No mention of alias mapping. Claude valid
View originalClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]
We introduce ClawBench, a benchmark that evaluates AI browser agents on 153 real-world everyday tasks across 144 live websites. Unlike synthetic benchmarks, ClawBench tests agents on actual production platforms. Key findings: The best model (Claude Sonnet 4.6) achieves only 33.3% success rate GLM-5 (Zhipu AI) comes second at 24.2% — surprisingly strong for a text-only model Finance and Academic tasks are easier (50% for the best model); Travel and Dev tasks are much harder No model exceeds 50% in any category — there's a long way to go What makes ClawBench different: Tasks on real live websites, not sandboxed environments 5 layers of behavioral data: session replay, screenshots, HTTP traffic, agent reasoning traces, browser actions Request interceptor blocks the final HTTP request before irreversible actions (payments, bookings), enabling safe evaluation Human ground-truth for every task Agentic evaluator with step-level traceable diagnostics Resources: Paper: https://arxiv.org/abs/2604.08523 Website (interactive leaderboard + trace viewer): https://claw-bench.com Dataset: https://huggingface.co/datasets/NAIL-Group/ClawBench GitHub: https://github.com/reacher-z/ClawBench PyPI: pip install clawbench-eval Happy to answer any questions! We're actively looking for feedback on task selection and evaluation methodology. [R] Research submitted by /u/Extreme_Play_8554 [link] [comments]
View originalI built an MCP server for Wanderlog — plan full trip itineraries through Claude instead of clicking through the UI
What I built An MCP server that connects Claude to your Wanderlog account. Instead of manually searching and adding places one by one, you describe the trip you want and Claude builds the full itinerary for you — using real places from Wanderlog’s database, along with hotels, notes between stops (transit tips, booking info), and checklists. Example: A few minutes later, you have a fully populated Wanderlog trip. Example itinerary (generated entirely by Claude): https://wanderlog.com/view/dmvegdhqsa/japan-golden-route--tokyo--hakone--kyoto--nara--osaka How Claude fits in The project was built using Claude Code, but more importantly, it’s designed with Claude as the primary planning agent. The server injects structured instructions at startup so Claude can: Organize itineraries by day Interleave places with practical notes Add useful context between stops Include pre-trip checklists It’s not just calling tools — Claude is making planning decisions around ordering, proximity, and relevance. What it can do Includes 11 tools: Create trips Search and add real places Add notes between stops Add hotels with check-in/check-out dates Add checklists (visa, currency, offline maps, etc.) List, view, and edit existing trips Generate shareable links Remove places Update date ranges Compatibility Works with: Claude Code Claude Desktop Cursor VS Code OpenAI Codex How it works Authenticates via your Wanderlog browser session cookie and runs entirely locally — no relay server, no third-party access. Links GitHub: https://github.com/shaikhspeare/wanderlog-mcp As far as I know, this is the first MCP server for travel planning. Feedback welcome. submitted by /u/I-HATE-CRUSTY-BREAD [link] [comments]
View originalBuilt a personal context layer so your AI agents truly know you
No matter how much we use AI agents, every new session starts with zero context about us. It doesn't know what we were working on yesterday, what we've been looking into, or what we even care about. We end up re-explaining ourselves every time, and honestly half the time we can't even describe the full picture because it's all over the place, e.g., browsing, old conversations, coding sessions, etc. So we built AIContext using Claude Code. It reads local data files from supported sources (browser SQLite databases, AI coding session logs, etc.), normalizes everything into a single flat SQLite table stored in ~/.aicontext/, and exposes a read-only SQL interface that AI agents can query as a subagent. Each source is a plugin, so adding new ones is straightforward. The installation scans for supported sources on your machine, asks consent on each one, ingests the data, and sets up an hourly background sync. It works out of the box with Claude Code and other AI coding agents. We've been using it ourselves for the past few days and the agent started picking up on patterns we never consciously noticed: connections between things we were researching weeks apart, habits we didn't know we had, blind spots we couldn't have seen on our own. There's something strangely moving about an AI understanding you better than you understand yourself. After setup, you can ask things like: Do thorough research on my history, and infer my MBTI What is the biggest miss of my daily life that I may not even be aware of? Check my history and suggest what I should do this weekend Recommend a book, video, or podcast for me https://preview.redd.it/11dr02g1jsug1.png?width=1021&format=png&auto=webp&s=693310b13cb4338b91d53fd41222f8d8b8b787d8 How Claude was involved: The entire project was built with Claude Code. Claude helped design the plugin architecture, wrote the ingestion pipeline, and iterated on the subagent interface. We reviewed and directed all decisions, but Claude Code did the heavy lifting on implementation. What it is NOT: Not cloud-based. Everything stays in ~/.aicontext/ on your machine. Not a screen recorder. It reads existing local data files already on your machine. Not locked to any single agent platform. This is still early but functional. We'd love for people to try it, tell us what breaks or what's missing, and we'd truly appreciate contributions if this interests you. GitHub: https://github.com/SophonMe/AIContext Happy to answer questions here. submitted by /u/Cold-Emu-864 [link] [comments]
View originalI made a Claude skill that builds learning paths from official docs instead of random blog links
Even though Claude is impressive and can do a lot out of the box, I like staying informed about how things actually work under the hood. Even if it's just curiosity, I want to understand the technology I'm using, not just trust the output. The problem is, whenever I asked AI for learning resources and forgot to specify where I wanted them from, I kept getting random responses from "innovative" sources. A Medium post from 2021. Some guy's YouTube playlist. A paid course recommendation. No structure, no sense of what to read first or whether any of it was current. So I made a skill called Mentor. Give it a topic, it gives you a phased learning path built mostly from official docs. The thing I care about: source hierarchy. Official docs first, always. Vendor and maintainer content second. Community posts only when official docs have a real gap — and it has to say why it's including them. It picks up your background from context too. I said "teach me Rust, I've been writing Go for 3 years" and it skipped the beginner stuff, framed ownership through Go's garbage collector, and ordered the Rust Book chapters in a way that makes sense if you already know systems programming. Something I haven't seen in other tools: every resource gets tagged with how to approach it. "Read now" means you need this before the next step. "Skim" means get the shape of it. "Hands-on" means clone it and build something. "Bookmark as reference" means you'll want it later but not right now. Most lists just hand you 15 links and say good luck. Broad topics (Rust, Kubernetes) get a 4-phase structure. Narrow topics (Terraform modules, GitLab CI caching) get compressed. It doesn't force everything into the same shape. Repo: https://github.com/ayhammouda/mentor .skill file on the release page - claude skill add mentor.skill. MIT licensed. 4 example outputs in the repo if you want to see what it produces before installing. Curious about topics where this breaks down, especially where official docs are bad enough that "official first" is the wrong call. submitted by /u/ahammouda [link] [comments]
View original[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.
The problem If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an afterthought — English-first tokenizer, English-first data, maybe some Italian sprinkled in during fine-tuning. The result: bloated token counts, poor morphology handling, and models that "speak Italian" the way a tourist orders coffee in Rome. I decided to fix this from the ground up. What is Dante-2B A 2.1B parameter, decoder-only, dense transformer. Trained from scratch — no fine-tune of Llama, no adapter on Mistral. Random init to coherent Italian in 16 days on 2× H200 GPUs. Architecture: LLaMA-style with GQA (20 query heads, 4 KV heads — 5:1 ratio) SwiGLU FFN, RMSNorm, RoPE d_model=2560, 28 layers, d_head=128 (optimized for Flash Attention on H200) Weight-tied embeddings, no MoE — all 2.1B params active per token Custom 64K BPE tokenizer built specifically for Italian + English + code Why the tokenizer matters This is where most multilingual models silently fail. Standard English-centric tokenizers split l'intelligenza into l, ', intelligenza — 3 tokens for what any Italian speaker sees as 1.5 words. Multiply that across an entire document and you're wasting 20-30% of your context window on tokenizer overhead. Dante's tokenizer was trained on a character-balanced mix (~42% Italian, ~36% English, ~22% code) with a custom pre-tokenization regex that keeps Italian apostrophe contractions intact. Accented characters (à, è, é, ì, ò, ù) are pre-merged as atomic units — they're always single tokens, not two bytes glued together by luck. Small detail, massive impact on efficiency and quality for Italian text. Training setup Data: ~300B token corpus. Italian web text (FineWeb-2 IT), English educational content (FineWeb-Edu), Italian public domain literature (171K books), legal/parliamentary texts (Gazzetta Ufficiale, EuroParl), Wikipedia in both languages, and StarCoderData for code. Everything pre-tokenized into uint16 binary with quality tiers. Phase 1 (just completed): 100B tokens at seq_len 2048. DeepSpeed ZeRO-2, torch.compile with reduce-overhead, FP8 via torchao. Cosine LR schedule 3e-4 → 3e-5 with 2000-step warmup. ~16 days, rock solid — no NaN events, no OOM, consistent 28% MFU. Phase 2 (in progress): Extending to 4096 context with 20B more tokens at reduced LR. Should take ~4-7 more days. What it can do right now After Phase 1 the model already generates coherent Italian text — proper grammar, correct use of articles, reasonable topic continuity. It's a 2B, so don't expect GPT-4 reasoning. But for a model this size, trained natively on Italian, the fluency is already beyond what I've seen from Italian fine-tunes of English models at similar scale. I'll share samples after Phase 2, when the model has full 4K context. What's next Phase 2 completion (est. ~1 week) HuggingFace release of the base model — weights, tokenizer, config, full model card SFT phase for instruction following (Phase 3) Community benchmarks — I want to test against Italian fine-tunes of Llama/Gemma/Qwen at similar sizes Why I'm posting now I want to know what you'd actually find useful. A few questions for the community: Anyone working with Italian NLP? I'd love to know what benchmarks or tasks matter most to you. What eval suite would you want to see? I'm planning perplexity on held-out Italian text + standard benchmarks, but if there's a specific Italian eval set I should include, let me know. Interest in the tokenizer alone? The Italian-aware 64K BPE tokenizer might be useful even independently of the model — should I release it separately? Training logs / loss curves? Happy to share the full training story with all the numbers if there's interest. About me I'm a researcher and entrepreneur based in Rome. PhD in Computer Engineering, I teach AI and emerging tech at LUISS university, and I run an innovation company (LEAF) that brings emerging technologies to businesses. Dante-2B started as a research project to prove that you don't need a massive cluster to train a decent model from scratch — you need good data, a clean architecture, and patience. Everything will be open-sourced. The whole pipeline — from corpus download to tokenizer training to pretraining scripts — will be on GitHub. Happy to answer any questions. 🇮🇹 Discussion also on r/LocalLLaMA here submitted by /u/angeletti89 [link] [comments]
View originalI built CLI-Anything-WEB — a Claude Code plugin that generates complete Python CLIs for any website (17 CLIs so far: Amazon, Airbnb, TripAdvisor, Reddit, YouTube...)
Point it at a URL, Claude Code captures the live HTTP traffic, and generates a production-grade Python CLI with commands, tests, REPL mode, and --json output — fully automated across 4 phases. How it works Phase 1 (capture): Records live browser traffic via playwright-cli Phase 2 (methodology): Analyzes endpoints, designs architecture, generates CLI code Phase 3 (testing): Writes unit + E2E tests (40–60+ per CLI, all passing) Phase 4 (standards): 3 parallel Claude agents do compliance review, then publishes 17 CLIs generated so far No-auth public scraping: Amazon, Airbnb, TripAdvisor, Reddit, YouTube, Hacker News, GitHub Trending, Pexels, Unsplash, ProductHunt, FutBin, Google AI Auth-required: NotebookLM, Google AI Studio, Booking.com, ChatGPT, CodeWiki Example — built Amazon search in one pipeline run bash cli-web-amazon search "crash cart adapter" --json cli-web-amazon bestsellers electronics --json cli-web-amazon product get B002CLKFTQ --json Open source https://github.com/ItamarZand88/CLI-Anything-WEB The entire pipeline runs inside Claude Code using a 4-phase skill system. Anti-bot bypass is handled with curl_cffi impersonation (Chrome/Safari iOS) — no Playwright needed at runtime. Each CLI is a standalone pip-installable package. Happy to answer questions about the skill system, anti-bot patterns, or how the testing phase works. submitted by /u/zanditamar [link] [comments]
View originalYes, GitBook AI offers a free tier. Pricing found: $25, $0.20, $65, $249, $12
Key features include: Your product and docs, in sync (finally), The knowledge layer for AI agents, A workflow your entire team will love, A best-in-class editing experience, Team up with a proactive partner.
GitBook AI is commonly used for: Streamlining documentation processes for development teams, Creating centralized knowledge bases for easy access to information, Facilitating collaboration among team members on documentation projects, Providing personalized AI assistance for quick information retrieval, Integrating product knowledge seamlessly into documentation, Enhancing onboarding experiences for new team members.
GitBook AI integrates with: GitHub, Slack, Jira, Trello, Notion, Google Drive, Confluence, Zapier, Figma, Microsoft Teams.
Based on user reviews and social mentions, the most common pain points are: API costs.

Unify Your Entire Team Around Docs
Jan 27, 2026
Based on 29 social mentions analyzed, 24% of sentiment is positive, 72% neutral, and 3% negative.