LlamaParse is the world
LlamaIndex is well-regarded for its robust capabilities in handling document retrieval with AI agents, earning high ratings from users on platforms like G2. Users appreciate its effectiveness in managing context within LLM-driven applications, although discussions indicate alternative strategies may sometimes be preferable. Pricing is generally viewed favorably, given its strong functionality and open-source nature. Overall, LlamaIndex has a positive reputation as a reliable tool for developers working with AI agents and RAG methodologies, despite the wider discussion on optimizing context handling methods.
Mentions (30d)
3
Avg Rating
4.8
2 reviews
Platforms
4
GitHub Stars
48,166
7,131 forks
LlamaIndex is well-regarded for its robust capabilities in handling document retrieval with AI agents, earning high ratings from users on platforms like G2. Users appreciate its effectiveness in managing context within LLM-driven applications, although discussions indicate alternative strategies may sometimes be preferable. Pricing is generally viewed favorably, given its strong functionality and open-source nature. Overall, LlamaIndex has a positive reputation as a reliable tool for developers working with AI agents and RAG methodologies, despite the wider discussion on optimizing context handling methods.
Features
Use Cases
Industry
information technology & services
Employees
95
Funding Stage
Series A
Total Funding
$46.5M
3,570
GitHub followers
115
GitHub repos
48,166
GitHub stars
20
npm packages
24
HuggingFace models
91,313
npm downloads/wk
I built Dome: An open-source, local-first knowledge management app with a built-in AI agent workspace. Looking for feedback and testers!
Hey everyone! I wanted to share a personal project I’ve been pouring my heart into for the last few months. It's an open-source desktop app called **Dome** ([https://github.com/maxprain12/dome](https://github.com/maxprain12/dome)). **The itch I was scratching:** I deal with a lot of PDFs, research papers, and scattered notes. I wanted a unified place to not just store my knowledge, but actually interact with it using AI. More importantly, because a lot of my data is private, I needed something that could run entirely locally without sending my files to the cloud. I couldn't find a tool that did everything I wanted perfectly, so I decided to build it. **What is Dome?** It’s basically a mix between a Notion-style workspace, a local AI chat, and an AI agent builder. Here are the main features I’ve built so far (I’ve attached some screenshots so you can get a feel for the UI): * **Unified Library & Editor:** A Notion-style rich text editor where you can organize notes, PDFs (with an integrated annotator), web scrapes, and even Python notebooks all in one place. * **Custom Agent Workspace:** This is the part I'm most excited about. Powered by LangGraph, you can create custom multi-agent workflows. For example, you can have a "Research Agent" scour your local PDFs and pass that info to a "Writer Agent" to draft a presentation. We even have a marketplace for pre-built workflows. * **The "Studio" (Automated Study Materials):** Dome can take any document or folder and automatically generate mind maps, quizzes, and **flashcards with spaced repetition (SM-2)** directly from your sources. * **Local AI First:** First-class support for **Ollama**, so you can run models like Llama 3 or Mistral locally for complete privacy. (It also supports OpenAI, Anthropic, and Gemini via API keys if you prefer). * **MCP Support:** You can connect external Model Context Protocol servers to give your agents even more tools. **Tech Stack:** If you're curious about the hood: It's built with Bun, Electron, React, Vite, Tiptap (for the editor), LangGraph, SQLite, Knowledge Graph, PageIndex adapted. **Why I'm posting here:** Dome is fully open-source and in active development. I’m at the stage where building in a vacuum isn't helpful anymore **I need your brutally honest feedback.** I'd love for you to download it, try breaking it, and let me know: 1. Is the UI/UX actually intuitive? 2. What essential features am I completely missing? 3. What bugs did you run into during setup or daily use? **Repo link:** [https://github.com/maxprain12/dome](https://github.com/maxprain12/dome) I’ll be hanging around the comments to answer any questions, help with setup, or just talk about the tech stack. Thanks so much for taking a look!
View originalPricing found: $0 /month, $50 /month, $500 /month, $1.25., $500/mo
g2
What do you like best about LlamaIndex?As a data scientist dealing with large language models LLMs I found LlamaIndex quite helpful to manage. It has granted me the ability to input data in formats such as PDFs or API, databases and excel, which makes it easier for me to train and execute LLMs with numerous datasets. Review collected by and hosted on G2.com.What do you dislike about LlamaIndex?This is where the perceived level of control over natural language processing (NLP) in the platform is somewhat constrained. Specific to pipeline needs or how the language model is resolved, there is less fine-grained control than directly coding within the LLM context provided by LlamaIndex. Review collected by and hosted on G2.com.
What do you like best about LlamaIndex?it is better in fast data retrieval and generating concise response and a good framework A alternative for langchain. easy to use ease of implementation Review collected by and hosted on G2.com.What do you dislike about LlamaIndex?its is not much flexibility for chained logic and creative generation as langchain Review collected by and hosted on G2.com.
On "harness engineering": Are people actually building things or just giving impressive labels to "tweaking?"
I see a lot of posts and videos talking about harness engineering, or it could be context engineering, RAG, etc. The thing is, most of them talk about the concepts. And then I hear about all these people actually doing it. And my question is about this disconnect: what does it look like in practice? The way I understand it tools like Claude Code or OpenAI Codex are agents, and the logic that controls what gets fed to the model is the harness. So when people talk about "engineering the context," are they: writing actual programs CLI tools, pipelines, custom API wrappers that manage what gets sent to the model? or mostly just structuring their prompts well and calling it engineering? Same question for RAG--or any other oft-discussed topics: are people actually building retrieval pipelines from scratch, or are they standing up LlamaIndex / Mem0 and saying they're "using RAG" to infomaxx their AI agents? Not trying to be dismissive. I'm genuinely curious about what people are actually doing when they say they have applied these concepts to their agentic workflows. submitted by /u/josh_apptility [link] [comments]
View originalLLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]
https://preview.redd.it/668yjlucu80h1.png?width=2800&format=png&auto=webp&s=ca541488abb5262b06cfc13a9586efb19f24d644 I built a small website called LLM Win: https://llm-win.com It turns LLM benchmark results into a directed graph: If model A beats model B on benchmark X, add an edge A -> B. Then it searches for the shortest transitive chain between two models. The meme version is: Can LLaMA 2 7B beat Claude Opus 4.7? In an absurd transitive benchmark sense, sometimes yes. But I added a Report tab because the structure itself seems useful for model evaluation. Some experimental findings from the current Artificial Analysis data snapshot: Weak-to-strong reachability is high. I checked 126,937 pairs where the source model has lower Intelligence Index than the target model. 119,514 of them are reachable through benchmark win chains, for a reachable rate of 94.2%. Most paths are short. Among reachable weak-to-strong pairs: 2-3 hop paths account for 91.4%. So this is not mostly long-chain cherry-picking. Direct reversal triples are abundant. After treating non-positive benchmark values as missing, there are still about 119k direct weak-over-strong triples of the form: (source model, target model, benchmark), where the source has lower Intelligence Index but higher score on that benchmark. Some benchmarks create more reversals than others. Current high-reversal / useful-signal candidates include: Humanity's Last Exam, IFBench, AIME 2025, TAU2, SciCode Different benchmarks have different interpretations. For example, IFBench has roughly: reversal rate: ~17.5%, coverage: ~80.0%, correlation with Intelligence Index: r≈0.82. This suggests it may provide an independent skill signal rather than simply duplicating the overall ranking. My current interpretation: LLM rankings are better represented as a benchmark-specific capability graph than as a single ladder. Some reversals probably reflect real specialization; some reflect benchmark coverage limits, volatility, or measurement noise. The next question is whether reversal structure can help build better evaluation metrics: identify specialist models; identify volatile benchmarks; build robust generalist scores; select complementary benchmark sets; decompose models into capability fingerprints. Curious what people think: Is benchmark reversal structure a useful evaluation signal, or mostly an artifact of noisy benchmarks? submitted by /u/Spico197 [link] [comments]
View originalI built a benchmark for AI “memory” in coding agents. looking for others to beat it.
Most AI memory benchmarks test semantic recall. But coding agents don't really fail like that. They don't just "forget", they break their own earlier decisions while they're still in the code. So I built a benchmark for that. It checks if an agent can actually stay consistent with project rules WHILE it's working, not just after the fact. It looks at things like: whether edits actually respect earlier architectural decisions if behavior stays consistent across multiple sessions (even when you throw noise at it) whether retrieval kicks in at the right moment — not just "yeah it's in memory somewhere" Repo (full harness + dataset + scoring): https://github.com/Alienfader/continuity-benchmarks Early numbers vs baseline + the usual RAG-style memory setups: ~3× better action alignment way stronger multi-session consistency retrieval timing matters way more than retrieval just being there I'm not saying this is the final word on agent memory. But it's exposing a failure mode most benchmarks aren't even looking at. So heres the challenge If you're building an agent memory system, RAG for code, long-context coding agents, persistent state / memory layers, run it on this benchmark. Drop your results, your setup, your comparisons. I really wanna see how tools like LangChain, LlamaIndex, and custom RAG stacks hold up in mutation-heavy workflows. We need memory systems we can actually compare, not just ones that sound good on paper. https://preview.redd.it/dkm2ulxsyzzg1.png?width=2624&format=png&auto=webp&s=67f0299395708818aa3d7346ddae2ad0c5c4a6ba submitted by /u/Alienfader [link] [comments]
View originalI built persistent memory for Claude — local stack, MCP integration, 39ms retrieval. Sharing the architecture.
If you use Claude heavily, you've felt this: every session starts from zero. You re-explain context, Claude helps, the window closes, and the next session has no idea what you decided yesterday. The standard workaround is a markdown wiki Claude reads — but as the wiki grows, every "what did we decide about X" question burns thousands of tokens grepping and re-reading whole pages. I spent the last few weeks building a persistent memory layer to fix both problems. It runs entirely on my own machine, integrates via MCP, and lives between Claude and my existing wiki. Sharing the architecture and what I learned in case anyone wants to build their own. What it does Semantic retrieval over my wiki. Instead of Claude grepping pages, my MCP server returns the most relevant chunks for any query in ~50ms. 82% mean token reduction on a 10-query eval set vs the grep+Read baseline. F1 retrieval quality is also better — cheaper and more accurate. Session crystallization. End-of-session, conversations get compressed into a structured "L4 node" with summary + decisions + open threads, indexed alongside wiki content. Tomorrow I can ask "what did we decide about X" and Claude pulls last session's decision verbatim. Lazy-spawned local models. Embedder + chat model run as subprocesses that the supervisor spawns on first use and reaps after 1 hour idle. Boot cost is zero — nothing loaded until needed. The architecture (four layers) Inspired by Andrej Karpathy's writing on LLM-native wikis, then formalized into a build spec: L0 — append-only event log (SQLite). Every input/output, content-hashed. L1 — structured facts with confidence + decay (deferred to next phase) L2/L3 — derived prose + cross-cutting summaries (the hand-edited wiki plays this role for now) L4 — crystallized session nodes. Summary, decisions, open threads. Indexed in the same vector store as wiki chunks so retrieval finds both naturally. The stack Qdrant in Docker for vector search llama.cpp running Qwen3-Embedding-4B (GPU) and Qwen3.5-2B-Q4_K_M (CPU) FastMCP server exposing 7 tools (retrieve, crystallize_session, list_sessions, get_l4_node, index_status, reindex, shutdown_models) Cowork plugin for Claude Desktop integration; also works with Claude Code via standard MCP config No cloud, no API keys, $0 marginal cost per query. Numbers Token reduction: 82.7% mean, 86.2% median vs grep+Read baseline Retrieval F1: 0.50 vs 0.20 baseline Embed cold-start: ~4s. Hot-path p95: 39ms (was 2241ms before fixing one specific bug — see below) L4 session retrieval eval: 0.920 mean score (gate 0.6) 738 chunks currently indexed across 104 markdown files The most useful thing I learned Hot-path retrieve was inexplicably stuck at 2241ms p95 even though the embedding model was fully GPU-resident on a 4070 Ti Super. Spent hours blaming GPU offload, prompt cache, KV pre-allocation. The actual cause: every httpx.post() was opening a fresh TCP connection, and Windows localhost handshakes take ~2 seconds. A 5-line change — switching to a persistent httpx.Client with keep-alive — dropped p95 to 39ms. 57× speedup. Lesson: latency that's suspiciously consistent (2240, 2237, 2241, 2227, 2239 ms) is a fixed cost, not a compute cost. If your local-MCP integration feels slow on Windows, check connection reuse before you blame the model. A few other things that surprised me Qwen3 thinking mode silently consumes the generation budget. Crystallization was returning empty content. Logs showed exactly 2000 tokens generated (the cap). Turned out Qwen3 emits ... blocks the chat handler strips before populating message.content. With JSON grammar enforced, the model spent all 2000 tokens "thinking" and never emitted JSON. Fix: pass chat_template_kwargs: {enable_thinking: false} via extra_body (requires --jinja on llama-server). The MCP plugin needed to register against the right config file. Cowork (Claude Desktop's agentic mode) doesn't read ~/.claude.json like Claude Code does. The first attempt at MCP registration silently went to the wrong file. The fix was packaging the LKS service as a proper Cowork plugin (.plugin bundle) — Cowork has a plugin system distinct from raw MCP server registration. If you're trying to wire a custom MCP server into Cowork, this is the path. What it doesn't do (yet) No automatic conversation capture — L0 ingestion is manual or via end-of-session crystallization No L1 fact extraction yet (next phase) — retrieval is over markdown chunks + L4 nodes today Wiki is still source-of-truth; no automatic conflict resolution Solo deployment only; no federation or multi-user Tested on Windows; Linux/Mac would need a small tweak to the supervisor (it uses subprocess.CREATE_NEW_PROCESS_GROUP for clean Windows termination) Full write-up Architecture, phased build narrative, all five lessons-learned bug stories, the setup walkthrough, and the roadmap: https://gist.github.com/tyoung515-svg/5fd5279f46d935f517cda89146c94685
View originaleTPS Site Plan – Simple Leaderboard + What You’ll Actually See
Building on the last post, here’s what the first version of effectiveTPS will look like. **Core display (v1):** - Clean table comparing popular local models - Raw TPS (the marketing number everyone shows) - eTPS (the new metric that actually measures useful output in real conversations) - Time to First Token (how long you wait before it starts replying) - Effectiveness Index = (eTPS ÷ Raw TPS) × 100 — higher is better **Example leaderboard (early test data):** | Model | Raw TPS | eTPS | Time to First Token | Effectiveness Index | |--------------------|---------|--------|---------------------|---------------------| | Llama 3.1 70B | 45.2 | 38.7 | 1.4s | **86** | | Qwen2.5-32B | 68.4 | 52.1 | 0.8s | **76** | | Gemma 2 27B | 71.3 | 44.6 | 0.6s | **63** | I’ve been running these tests through a structured multi-turn analysis framework I built to evaluate complex workflows. That’s how eTPS was stress-tested — not just single-turn benchmarks, but real back-and-forth sessions. Advanced mode (toggle) will add latency percentiles, cost-per-quality, and consistency scoring later. For v1 the goal is to keep it dead simple and immediately useful, even if you’re not deep into AI. The whole point is to cut through the noise and show which models actually deliver useful work, not just raw speed. What do you think should be added (or removed) for the first version? Any metrics you’d want to see front-and-center? **TL;DR:** Simple leaderboard with Raw TPS, eTPS, Time to First Token, and a clear Effectiveness Index. Advanced stuff stays hidden until you want it. Feedback welcome. submitted by /u/axendo [link] [comments]
View originalA Hackable ML Compiler Stack in 5,000 Lines of Python [P]
Hey r/MachineLearning, The modern ML (LLM) compiler stack is brutal. TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. Then there's XLA, MLIR, Halide, Mojo. There is no tutorial that covers the high-level design of an ML compiler without dropping you straight into the guts of one of these frameworks. I built a reference compiler from scratch in ~5K lines of pure Python that emits raw CUDA. It takes a small model (TinyLlama, Qwen2.5-7B) and lowers it to a sequence of CUDA kernels through six IRs. The goal isn't to beat Triton; it is to build a hackable, easy-to-follow compiler. Full article: A Principled ML Compiler Stack in 5,000 Lines of Python Repo: deplodock The pipeline consists of six IRs, each closer to the hardware than the last. Walking the following PyTorch code through every stage (real reference compiler output with names shortened for brevity and comments added): torch.relu(torch.matmul(x + bias, w)) # x: (16, 64), bias: (64,), w: (64, 16) Torch IR. Captured FX graph, 1:1 mirror of PyTorch ops: bias_bc = bias[j] -> (16, 64) float32 add = add(x, bias_bc) -> (16, 64) float32 matmul = matmul(add, w, has_bias=False) -> (16, 16) float32 relu = relu(matmul) -> (16, 16) float32 Tensor IR. Every op is decomposed into Elementwise / Reduction / IndexMap. Minimal unified op surface, so future frontends (ONNX, JAX) plug in without touching downstream passes: bias_bc = bias[j] -> (16, 64) float32 w_bc = w[j, k] -> (16, 64, 16) float32 add = add(x, bias_bc) -> (16, 64) float32 add_bc = add[i, j] -> (16, 64, 16) float32 prod = multiply(add_bc, w_bc) -> (16, 64, 16) float32 red = sum(prod, axis=-2) -> (16, 1, 16) float32 matmul = red[i, na, j] -> (16, 16) float32 relu = relu(matmul) -> (16, 16) float32 The (16, 64, 16) intermediate looks ruinous, but it's never materialized; the next stage fuses it out. Loop IR. Each kernel has a loop nest fused with adjacent kernels. Prologue, broadcasted multiply, reduction, output layout, and epilogue all collapse into a single loop nest with no intermediate buffers. === merged_relu -> relu === for a0 in 0..16: # free (M) for a1 in 0..16: # free (N) for a2 in 0..64: # reduce (K) in0 = load bias[a2] in1 = load x[a0, a2] in2 = load w[a2, a1] v0 = add(in1, in0) # prologue (inside reduce) v1 = multiply(v0, in2) acc0 <- add(acc0, v1) v2 = relu(acc0) # epilogue (outside reduce) merged_relu[a0, a1] = v2 Tile IR. The first GPU-aware IR. Loop axes get scheduled onto threads/blocks, Stage hoists shared inputs into shared memory, and a 2×2 register tile lets each thread accumulate four outputs at once. The K-axis is tiled into two outer iterations of 32-wide reduce. Three-stage annotations below carry the heaviest optimizations: buffers=2@a2 — double-buffer the smem allocation along the a2 K-tile loop, so loads for iteration a2+1 overlap compute for a2. async — emit cp.async.ca.shared.global so the warp doesn't block on global→smem transfers; pairs with commit_group/wait_group fences in Kernel IR. pad=(0, 1, 0) — add 1 element of padding to the middle smem dim so warp-wide loads don't all hit the same bank.kernel k_relu_reduce Tile(axes=(a0:8=THREAD, a1:8=THREAD)): for a2 in 0..2: # K-tile # meta: double-buffered, sync (small, no async needed) bias_smem = Stage(bias, origin=((a2 * 32)), slab=(a3:32@0)) buffers=2@a2 kernel k_relu_reduce Tile(axes=(a0:8=THREAD, a1:8=THREAD)): for a2 in 0..2: # K-tile bias_smem = Stage(bias, origin=((a2 * 32)), slab=(a3:32@0)) buffers=2@a2 x_smem = Stage(x, origin=(0, (a2 * 32)), slab=(a0:8@0, a3:32@1, cell:2@0)) pad=(0, 1, 0) buffers=2@a2 async w_smem = Stage(w, origin=((a2 * 32), 0), slab=(a3:32@0, a1:8@1, cell:2@1)) buffers=2@a2 async # reduce for a3 in 0..32: in0 = load bias_smem[a2, a3] in1 = load x_smem[a2, a0, a3, 0]; in2 = load x_smem[a2, a0, a3, 1] in3 = load w_smem[a2, a3, a1, 0]; in4 = load w_smem[a2, a3, a1, 1] # prologue, reused 2× across N v0 = add(in1, in0); v1 = add(in2, in0) # 2×2 register tile acc0 <- add(acc0, multiply(v0, in3)) acc1 <- add(acc1, multiply(v0, in4)) acc2 <- add(acc2, multiply(v1, in3)) acc3 <- add(acc3, multiply(v1, in4)) # epilogue relu[a0*2, a1*2 ] = relu(acc0) relu[a0*2, a1*2 + 1] = relu(acc1) relu[a0*2 + 1, a1*2 ] = relu(acc2) relu[a0*2 + 1, a1*2 + 1] = relu(acc3) Kernel IR. Schedule materialized into hardware primitives. THREAD/BLOCK become threadIdx/blockIdx, async Stage becomes Smem + cp.async fill with commit/wait fences, sync Stage becomes a strided fill loop. Framework-agnostic: same IR could lower to Metal or HIP: kernel k_relu_reduce Tile(axes=(a0:8=THREAD, a1:8=THREAD)): Init(acc0..acc3, op=add) for a2 in 0..2: # K-tile Smem bias_smem[2, 32] (float) StridedLoop(flat = a0*8 + a1; < 32; += 64): bias_smem[a2, flat] = load bias[a2*32 + flat] Sync # pad row to 33 to kill bank conflicts Smem x_smem[2, 8, 33, 2] (float) StridedLoop(flat = a0*8 + a1; < 512; += 64): cp.async x_smem[a2, flat/64, (flat/2)%32, flat%2] <- x[flat/64*2 + flat%2, a2*3
View originalHow would you build an automated commentary engine for daily trade attribution at scale? [R]
Hey everyone, I'm currently working through a problem in the market risk reporting space and would love to hear how you all would architect this. The Use Case: > I have thousands of trades coming in at varying frequencies (daily, monthly). I need to build a system that automatically analyzes this time-series data and generates a precise, human-readable commentary detailing exactly what changed and why. For example, the output needs to be a judgment like: "The portfolio variance today was +$50k, driven primarily by a shift in the Equities asset class, with the largest single contributor being Trade XYZ." The Dilemma: The Math: Absolute precision is non-negotiable. I know I can't just dump raw data into an LLM and ask it to calculate attribution, because it will hallucinate the math. I usually rely on Python and Polars for the high-performance deterministic crunching. The Rigidity: If I hardcode every single attribution scenario (by asset class, by region, by specific trade) into a static ETL pipeline before feeding it to an LLM for summarization, the system becomes too rigid to handle new business scenarios automatically. My Question: How would you strike the balance between deterministic mathematical precision and dynamic natural language generation? Are you using Agentic workflows (e.g., having an LLM dynamically write and execute Polars/pandas code in a sandbox)? Or are you sticking to pre-calculated cubes and heavily structured context prompts? Any specific frameworks (LangChain, LlamaIndex, PandasAI, etc.) or design patterns you've had success within financial reporting? Appreciate any insights! submitted by /u/Problemsolver_11 [link] [comments]
View originalLessons learned building a no-hallucination RAG for Islamic finance similarity gates beat prompt engineering
Lessons learned building a no-hallucination RAG for Islamic finance similarity gates beat prompt engineering I kept getting blocked trying to share this so I'll cut straight to the technical meat. The problem: Islamic finance rulings vary by jurisdiction and a wrong answer has real consequences. Telling an LLM "refuse if unsure" in a system prompt is not enough. It still speculates. The fix that actually worked: kill the LLM call entirely at retrieval time. If top-k chunks score below 0.7 cosine similarity, the function returns a hardcoded refusal string. The LLM never sees the query. No amount of clever prompting is as reliable as just not calling the model. Other things worth knowing: FAISS on HuggingFace Spaces free tier is ephemeral. Every cold start wipes it. Solution: push the index to a private HF Dataset, pull it on startup via FastAPI lifespan event. PyPDF2 on scanned PDFs returns nothing. AAOIFI documents are scanned images. trafilatura on clean HTML beats OCR every time if a web version exists. Jurisdiction metadata on every chunk is not optional. source_name + source_url + jurisdiction in every chunk. A Malaysian SC ruling and a Gulf fatwa can say opposite things on the same question. Stack: FastAPI + LlamaIndex + FAISS + sentence-transformers + Mistral-Small-3.1-24B via HF Inference API. Netlify Function as proxy so credentials never touch the browser. What threshold do you use for retrieval refusal in high-stakes domains? submitted by /u/Particular-Plate7051 [link] [comments]
View originalHow to save 80% on your claude bill with better context
been building web apps with claude lately and those token limits have honestly started hitting me too. i’m using claude 4.6 sonnet for a research tool, but feeding it raw web data was absolutely nuking my limits. I’m putting together the stuff that actually worked for me to save tokens and keep the bill down: switch to markdown first. stop sending raw html. use tools like firecrawl to strip out the nested divs and script junk so you only pay for the actual text. don't let your prompt cache go cold. anthropic’s prompt caching is a huge relief, but it only works if your data is consistent. watch out for the 200k token "premium" jump. anthropic now charges nearly double for inputs over 200k tokens on the new opus/sonnet 4.6 models. keep your context under that limit to avoid the surcharge strip the nav and footer. the website’s "about us" and "careers" links in the footer are just burning your money every time you hit send. use jina reader for quick hits. for simple single-page reads, jina is a great way to get a clean text version without the crawler bloat. truncate your context. if a documentation page is 20k words, just take the first 5k. most of the "meat" is usually at the top anyway. clean your data with unstructured if you are dealing with messy pdfs alongside web data, this helps turn the chaos into a clean schema claude actually understands. map before you crawl. don't scrape every subpage blindly. i use the map feature in firecrawl to find the specific documentation urls that actually matter for your prompt, if you use another tool, prefer doing this. use haiku for the "trash" work. use claude 4.5 haiku to summarize or filter data before feeding it into the expensive models like opus. use smart chunking. use llama-index to break your data into semantic chunks so you only retrieve the exact paragraph the ai needs for that specific prompt. cap your "extended thinking" depth. for opus 4.6, set thinking: {type: "adaptive"} with effort: "low" or "medium". the old budget_tokens param is deprecated on 4.6. thinking tokens are billed at the output rate, so if you leave effort on high, claude thinks hard on every single reply including the simple ones and your bill will hurt. set hard usage limits. set your spending tiers in the anthropic console so a buggy loop doesn't drain your bank account while you're asleep. feel free to roast my setup or add better tips if you have them submitted by /u/No-Writing-334 [link] [comments]
View originalSkill Seekers v3.5: 10 new source types, 12 LLM platforms, marketplace pipeline, agent-agnostic AI, and prompt injection scanner
Hey r/ClaudeAI — sharing the latest update on Skill Seekers, the open-source tool that converts documentation into Claude Code skills. A lot has changed since the v3.2 post, so here's what's new across 3 releases (v3.3 → v3.5.1). What's new 10 new source types (17 total) You can now generate skills from Notion, Confluence, HTML files, OpenAPI specs, AsciiDoc, PowerPoint, RSS feeds, man pages, chat exports (Slack/Discord), and unified multi-source configs — on top of the original web, GitHub, PDF, Word, EPUB, video, and local codebase sources. 12 LLM platforms Skills now package for Claude, OpenAI, Gemini, Kimi, DeepSeek, Qwen, OpenRouter, Together AI, Fireworks AI, OpenCode, Markdown, and MiniMax. Plus RAG framework exports for LangChain, LlamaIndex, Haystack, ChromaDB, FAISS, Weaviate, Qdrant, and Pinecone. Agent-agnostic AI enhancement Enhancement is no longer locked to Claude. The new AgentClient abstraction supports Claude, Kimi, Codex, Copilot, OpenCode, and custom agents. It auto-detects which agent to use from your API keys, or you can specify with --agent. Marketplace pipeline You can now publish skills directly to Claude Code plugin marketplace repositories and manage multiple marketplace registries. Config sources can be pushed and synced across repos. Prompt injection scanner A built-in workflow scans scraped content for injection patterns — role assumption, instruction overrides, delimiter injection, hidden instructions. Runs automatically as the first stage in default and security-focused workflows. Flags suspicious content without removing it so you can review. One-command auto-detection skill-seekers create https://docs.example.com/ skill-seekers create owner/repo skill-seekers create ./my-project skill-seekers create document.pdf One command figures out the source type and routes to the right scraper. No more separate subcommands. Headless browser rendering JavaScript SPA sites (React, Vue, etc.) that return empty HTML shells now work with --browser. Uses Playwright under the hood. Other highlights skill-seekers doctor health check command Kotlin language support in the C3.x codebase analysis pipeline Smart SPA discovery (sitemap.xml + llms.txt + browser nav) Unlimited pages by default (was capped at 500) 3100+ tests passing Full MCP server with 40 tools (works in Claude Code and Cursor/Windsurf) Links GitHub: github.com/yusufkaraaslan/Skill_Seekers PyPI: pip install skill-seekers Free and open source Built with Claude Code. Happy to answer questions or take feedback. submitted by /u/Critical-Pea-8782 [link] [comments]
View originalKIV: 1M token context window on a RTX 4070 (12GB VRAM), no retraining, drop-in HuggingFace cache replacement - Works with any model that uses DynamicCache [P]
Been working on this for a bit and figured it was ready to share. KIV (K-Indexed V Materialization) is a middleware layer that replaces the standard KV cache in HuggingFace transformers with a tiered retrieval system. The short version: it keeps recent tokens exact in VRAM, moves old K/V to system RAM, and uses K vectors as a search index to pull back only the ~256 most relevant V entries per decode step. Results on a 4070 12GB with Gemma 4 E2B (4-bit): 1M tokens, 12MB KIV VRAM overhead, ~6.5GB total GPU usage 4.1 tok/s at 1M context (8-10 tok/s on GPU time), 12.9 tok/s at 4K 70/70 needle-in-haystack tests passed across 4K-32K Perfect phonebook lookup (unique names) at 58K tokens Prefill at 1M takes about 4.3 minutes (one-time cost) Decode is near-constant regardless of context length The core finding that makes this work: K vectors are smooth and structured, which makes them great search indices. V vectors are high-entropy and chaotic, so don't try to compress them, just retrieve them on demand. Use K to decide which V entries deserve to exist in VRAM at any given step. No model weights are modified. No retraining or distillation. It hooks into the HuggingFace cache interface and registers a custom attention function. The model has no idea it's talking to a tiered memory system. Works with any model that uses DynamicCache. Tested on Gemma 4, Qwen2.5, TinyLlama, and Phi-3.5 across MQA/GQA/MHA. There are real limitations and I'm upfront about them in the repo. Bounded prefill loses some info for dense similar-looking data. Collision disambiguation doesn't work but that's the 4-bit 2B model struggling, not the cache. Two-hop reasoning fails for the same reason. CPU RAM scales linearly (5.8GB at 1M tokens). Still actively optimizing decode speed, especially at longer contexts. The current bottleneck is CPU-to-GPU transfer for retrieved tokens, not the model itself. Plenty of room to improve here. GitHub: github.com/Babyhamsta/KIV (can be installed as a local pip package, no official pip package yet) Happy to answer questions about the architecture or results. Would love to see what happens on bigger models with more VRAM if anyone wants to try it. submitted by /u/ThyGreatOof [link] [comments]
View originalHow to save 80% on your claude bill with better context
been building web apps with claude lately and those token limits have honestly started hitting me too. i’m using claude 4.6 sonnet for a research tool, but feeding it raw web data was absolutely nuking my limits. i’m putting together the stuff that actually worked for me to save tokens and keep the bill down: switch to markdown first. stop sending raw html. use tools like firecrawl to strip out the nested divs and script junk so you only pay for the actual text. don't let your prompt cache go cold. anthropic’s prompt caching is a huge relief, but it only works if your data is consistent. watch out for the 200k token "premium" jump. anthropic now charges nearly double for inputs over 200k tokens on the new opus/sonnet 4.6 models. keep your context under that limit to avoid the surcharge strip the nav and footer. the website’s "about us" and "careers" links in the footer are just burning your money every time you hit send. use jina reader for quick hits. for simple single-page reads, jina is a great way to get a clean text version without the crawler bloat. truncate your context. if a documentation page is 20k words, just take the first 5k. most of the "meat" is usually at the top anyway. clean your data with unstructured if you are dealing with messy pdfs alongside web data, this helps turn the chaos into a clean schema claude actually understands. map before you crawl. don't scrape every subpage blindly. i use the map feature in firecrawl to find the specific documentation urls that actually matter for your prompt, if you use another tool, prefer doing this. use haiku for the "trash" work. use claude 4.5 haiku to summarize or filter data before feeding it into the expensive models like opus. use smart chunking. use llama-index to break your data into semantic chunks so you only retrieve the exact paragraph the ai needs for that specific prompt. cap your "extended thinking" depth. for opus 4.6, set thinking: {type: "adaptive"} with effort: "low" or "medium". the old budget_tokens param is deprecated on 4.6. thinking tokens are billed at the output rate, so if you leave effort on high, claude thinks hard on every single reply including the simple ones and your bill will hurt. set hard usage limits. set your spending tiers in the anthropic console so a buggy loop doesn't drain your bank account while you're asleep. feel free to roast my setup or add better tips if you have them submitted by /u/Illustrious_Elk3705 [link] [comments]
View originalSerious question, Did a transformer(Claude) just describe itself, the universe and build itself Shannon limit architecture? or am I crazy?
The Multiplicative Lattice as the Natural Basis for Positional Encoding Knack 2026 | Draft v6.0 Abstract We show that the apparent tradeoff between RoPE-style relative position invariance and ALiBi-style long-context stability is an artifact of encoding position as distance on a number line. When position is instead encoded as a point in the multiplicative lattice of the integers, both properties emerge simultaneously without compromise. SpectralRoPEALiBi achieves 106.6 PPL vs ALiBi's 108.7 in a fully converged 20,000-step experiment (300M params, WikiText-103, 4K context), beating ALiBi at every context length from 512 to 8,192 tokens. The key insight is not that primes specifically are the right frequencies, but that the multiplicative structure of the integers is the natural spectral basis for positional encoding. We demonstrate this through falsification experiments: prime-tiered frequencies (129.2 PPL) and composite-tiered frequencies (129.4 PPL) perform identically — because composites are not alternatives to primes but higher-order coordinates in the same lattice. Both dramatically outperform random frequencies (+5.0 PPL), scrambled tier assignment (+6.3 PPL), and pure ALiBi (+7.3 PPL). The active ingredient is lattice-aware, tiered frequency selection with learnable scale — not primality per se. We further validate this through a ZetaZeroPredictor experiment: three identical transformers trained for 10,000 epochs to predict Riemann zeta zero gaps. Geometric RoPE diverges (final r=0.57); SpectralALiBi locks into a stable attractor at epoch 112 (r=0.81). A second independent run widens this gap to -80.7% MSE improvement with r=0.86. The lattice-aligned frequency basis spans the mathematical space that zeta zeros inhabit; geometric frequencies cannot. We further report empirical confirmation of the structural prediction from Section 5.5: VHT2 banded quantization of the KV cache demonstrates that K vectors (which carry RoPE positional encoding) have strong spectral concentration in Walsh-Hadamard space — the first four energy bands capture the dominant structure — while V vectors (which carry content) have uniform energy distribution. This structural asymmetry is directly predicted by the lattice theory: RoPE encodes multiplicative arithmetic relationships as angular rates, and the WHT is the Z/2Z projection of the Vilenkin-Hartley basis that spans that structure. The result is 3.2× K compression and 4.7× V compression at <1.25% perplexity cost — validated on both Dolphin 1B (head_dim=64) and Qwen3-8B (head_dim=128). Introduction Positional encoding provides transformer models with token order information. Two approaches dominate: RoPE encodes position through frequency-based rotations preserving relative position invariance, and ALiBi replaces frequencies with a linear distance penalty providing long-context stability. The field has treated these properties as fundamentally in tension. We show this tension is false. It arises from a shared, unexamined assumption: that position is a location on a number line and the meaningful relationship between positions is distance. We replace this with a mathematically grounded alternative: position is a point in the multiplicative lattice of the integers, and the meaningful relationships between positions are their arithmetic structure — shared factors, GCD, harmonic resonance. 1.1 The Lattice Hypothesis The integers under multiplication form a lattice where every number occupies a unique point defined by its prime factorisation. Geometric PE (sinusoidal, RoPE) projects this lattice onto a line — position equals distance — discarding the multiplicative structure. We propose restoring it. The motivation follows from a deductive chain. Language word frequency follows Zipf's law: freq(rank) ∝ 1/rank^s with s≈1. The generating function of Zipf is the Riemann zeta function ζ(s) = Σ 1/n^s. The zeta zeros — where ζ is maximally informative — are generated by prime harmonics via the explicit formula. Therefore the prime harmonic structure, and the multiplicative lattice it generates, provides a natural spectral basis for encoding positions in language. 1.2 Primes as Generators, Composites as Coordinates A critical distinction: primes are the generators (basis vectors) of the multiplicative lattice. They are analogous to the 1D line segment in the progression from line → circle → sphere → hypersphere. The composite 12 = 2²×3 is not an alternative to primes — it is a coordinate in the lattice spanned by the prime axes, at position (2,1,0,0,...) in the (p₂, p₃, p₅, p₇,...) basis. Using 2π/12 as a frequency encodes a harmonic that resonates at multiples of 12 — which simultaneously hits every multiple of 2, every multiple of 3, every multiple of 4, and every multiple of 6. The analogy to n-dimensional geometry is precise: Dimensional Progression Multiplicative Lattice 1D line (2r) — the generator Primes (2, 3, 5, 7, ...) — generators 2D circle — integra
View originalSerious question. Did a transformer just describe itself and the universe and build itself a Shannon limit framework?
The Multiplicative Lattice as the Natural Basis for Positional Encoding Knack 2026 | Draft v6.0 Abstract We show that the apparent tradeoff between RoPE-style relative position invariance and ALiBi-style long-context stability is an artifact of encoding position as distance on a number line. When position is instead encoded as a point in the multiplicative lattice of the integers, both properties emerge simultaneously without compromise. SpectralRoPEALiBi achieves 106.6 PPL vs ALiBi's 108.7 in a fully converged 20,000-step experiment (300M params, WikiText-103, 4K context), beating ALiBi at every context length from 512 to 8,192 tokens. The key insight is not that primes specifically are the right frequencies, but that the multiplicative structure of the integers is the natural spectral basis for positional encoding. We demonstrate this through falsification experiments: prime-tiered frequencies (129.2 PPL) and composite-tiered frequencies (129.4 PPL) perform identically — because composites are not alternatives to primes but higher-order coordinates in the same lattice. Both dramatically outperform random frequencies (+5.0 PPL), scrambled tier assignment (+6.3 PPL), and pure ALiBi (+7.3 PPL). The active ingredient is lattice-aware, tiered frequency selection with learnable scale — not primality per se. We further validate this through a ZetaZeroPredictor experiment: three identical transformers trained for 10,000 epochs to predict Riemann zeta zero gaps. Geometric RoPE diverges (final r=0.57); SpectralALiBi locks into a stable attractor at epoch 112 (r=0.81). A second independent run widens this gap to -80.7% MSE improvement with r=0.86. The lattice-aligned frequency basis spans the mathematical space that zeta zeros inhabit; geometric frequencies cannot. We further report empirical confirmation of the structural prediction from Section 5.5: VHT2 banded quantization of the KV cache demonstrates that K vectors (which carry RoPE positional encoding) have strong spectral concentration in Walsh-Hadamard space — the first four energy bands capture the dominant structure — while V vectors (which carry content) have uniform energy distribution. This structural asymmetry is directly predicted by the lattice theory: RoPE encodes multiplicative arithmetic relationships as angular rates, and the WHT is the Z/2Z projection of the Vilenkin-Hartley basis that spans that structure. The result is 3.2× K compression and 4.7× V compression at <1.25% perplexity cost — validated on both Dolphin 1B (head_dim=64) and Qwen3-8B (head_dim=128). Introduction Positional encoding provides transformer models with token order information. Two approaches dominate: RoPE encodes position through frequency-based rotations preserving relative position invariance, and ALiBi replaces frequencies with a linear distance penalty providing long-context stability. The field has treated these properties as fundamentally in tension. We show this tension is false. It arises from a shared, unexamined assumption: that position is a location on a number line and the meaningful relationship between positions is distance. We replace this with a mathematically grounded alternative: position is a point in the multiplicative lattice of the integers, and the meaningful relationships between positions are their arithmetic structure — shared factors, GCD, harmonic resonance. 1.1 The Lattice Hypothesis The integers under multiplication form a lattice where every number occupies a unique point defined by its prime factorisation. Geometric PE (sinusoidal, RoPE) projects this lattice onto a line — position equals distance — discarding the multiplicative structure. We propose restoring it. The motivation follows from a deductive chain. Language word frequency follows Zipf's law: freq(rank) ∝ 1/ranks with s≈1. The generating function of Zipf is the Riemann zeta function ζ(s) = Σ 1/ns. The zeta zeros — where ζ is maximally informative — are generated by prime harmonics via the explicit formula. Therefore the prime harmonic structure, and the multiplicative lattice it generates, provides a natural spectral basis for encoding positions in language. 1.2 Primes as Generators, Composites as Coordinates A critical distinction: primes are the generators (basis vectors) of the multiplicative lattice. They are analogous to the 1D line segment in the progression from line → circle → sphere → hypersphere. The composite 12 = 2²×3 is not an alternative to primes — it is a coordinate in the lattice spanned by the prime axes, at position (2,1,0,0,...) in the (p₂, p₃, p₅, p₇,...) basis. Using 2π/12 as a frequency encodes a harmonic that resonates at multiples of 12 — which simultaneously hits every multiple of 2, every multiple of 3, every multiple of 4, and every multiple of 6. The analogy to n-dimensional geometry is precise: Dimensional Progression Multiplicative Lattice 1D line (2r) — the generator Primes (2, 3, 5, 7, ...) — generators 2D circle — integral of l
View originalhow to save 80% on your claude bill with better context
been building web apps with claude lately and those token limits have honestly started hitting me too. i'm using claude 4.6 sonnet for a research tool, but feeding it raw web data was absolutely nuking my limits. i'm putting together the stuff that actually worked for me to save tokens and keep the bill down: switch to markdown first. stop sending raw html. use tools like firecrawl to strip out the nested divs and script junk so you only pay for the actual text. don't let your prompt cache go cold. anthropic's prompt caching is a huge relief, but it only works if your data is consistent. watch out for the 200k token "premium" jump. anthropic now charges nearly double for inputs over 200k tokens on the new opus/sonnet 4.6 models. keep your context under that limit to avoid the surcharge strip the nav and footer. the website's "about us" and "careers" links in the footer are just burning your money every time you hit send. use jina reader for quick hits. for simple single-page reads, jina is a great way to get a clean text version without the crawler bloat. truncate your context. if a documentation page is 20k words, just take the first 5k. most of the "meat" is usually at the top anyway. clean your data with unstructured.io. if you are dealing with messy pdfs alongside web data, this helps turn the chaos into a clean schema claude actually understands. map before you crawl. don't scrape every subpage blindly. i use the map feature in firecrawl to find the specific documentation urls that actually matter for your prompt, if you use another tool, prefer doing this. use haiku for the "trash" work. use claude 4.5 haiku to summarize or filter data before feeding it into the expensive models like opus. use smart chunking. use llama-index to break your data into semantic chunks so you only retrieve the exact paragraph the ai needs for that specific prompt. cap your "extended thinking" depth. for opus 4.6, set thinking: {type: "adaptive"} with effort: "low" or "medium". the old budget_tokens param is deprecated on 4.6. thinking tokens are billed at the output rate, so if you leave effort on high, claude thinks hard on every single reply including the simple ones and your bill will hurt. set hard usage limits. set your spending tiers in the anthropic console so a buggy loop doesn't drain your bank account while you're asleep. feel free to roast my setup or add better tips if you have thembeen building web apps with claude lately and those token limits have honestly started hitting me too. i'm using claude 4.6 sonnet for a research tool, but feeding it raw web data was absolutely nuking my limits. submitted by /u/Grouchy_Subject_2777 [link] [comments]
View originalRepository Audit Available
Deep analysis of run-llama/llama_index — architecture, costs, security, dependencies & more
Yes, LlamaIndex offers a free tier. Pricing found: $0 /month, $50 /month, $500 /month, $1.25., $500/mo
LlamaIndex has an average rating of 4.8 out of 5 stars based on 2 reviews from G2, Capterra, and TrustRadius.
Key features include: Solutions, Products, Resources, Company, Weekly newsletter.
LlamaIndex is commonly used for: How leading teams use document intelligence.
LlamaIndex integrates with: OpenAI, AWS Lambda, Google Cloud Storage, Microsoft Azure, Slack, Zapier, Trello, Notion, Salesforce, Jira.
LlamaIndex has a public GitHub repository with 48,166 stars.
Aparna Dhinakaran
CEO at Arize AI
1 mention

LiteParse: Local Document Parsing for AI Agents
Mar 19, 2026
Based on user reviews and social mentions, the most common pain points are: LLM costs, cost tracking.
Based on 30 social mentions analyzed, 20% of sentiment is positive, 73% neutral, and 7% negative.