Haystack Review — 4.7★ from 20 Reviews | Pricing & Alternatives | Payloop

Haystack

frameworktiered

Build custom AI agents and RAG applications with smart context engineering, powered by open AI orchestration.

Haystack is highly rated by users, frequently receiving perfect scores for its functionality and ease of use, as seen in reviews from g2. Its main strengths include the ability to handle large context windows and seamless integration with HuggingFace models. There appear to be few complaints, though some mention occasional output quirks with fixed-size token chunking. Pricing sentiment isn't explicitly mentioned, but the predominant high ratings suggest users find it valuable, supporting its overall reputation as a reliable and effective software tool.

Mentions (30d)

2

Avg Rating

4.7

20 reviews

Platforms

2

GitHub Stars

24,827

2,715 forks

8 integrations10 featuresVenture (Round not Specified)

Voices Discussing Haystack

Aparna Dhinakaran

CEO at Arize AI

3 mentions

James Briggs

Staff Developer Advocate at Pinecone

1 mention

Matt Shumer

CEO at HyperWrite / OthersideAI

1 mention

Latest Videos

Deepset Team

Deepset Team

Feb 13, 2026

AI With Purpose: Solving What Matters

AI With Purpose: Solving What Matters

Nov 7, 2025

Share:Twitter LinkedIn

Product Screenshots

Haystack screenshot 1

Haystack screenshot 2

Haystack screenshot 3

Haystack screenshot 4

AI Summary

Haystack is highly rated by users, frequently receiving perfect scores for its functionality and ease of use, as seen in reviews from g2. Its main strengths include the ability to handle large context windows and seamless integration with HuggingFace models. There appear to be few complaints, though some mention occasional output quirks with fixed-size token chunking. Pricing sentiment isn't explicitly mentioned, but the predominant high ratings suggest users find it valuable, supporting its overall reputation as a reliable and effective software tool.

Features & Use Cases

Features

Private, secure engineering supportBest practices templates deployment guidesAccess to flexible servicesFlexible pricing based on company sizeVisual, code-aligned pipeline designData, retrieval, and testing workflowsSecure access controls and auditabilityScalable cloud or on-prem deploymentBuild Transparent, Context Engineered AI SystemsIntegrate Freely with Your AI Stack

Use Cases

Operate at Enterprise ScaleVisual, code-aligned pipeline designData, retrieval, and testing workflowsSecure access controls and auditabilityScalable cloud or on-prem deployment

Company Intel

Industry

information technology & services

Employees

85

Funding Stage

Venture (Round not Specified)

Total Funding

$45.6M

Social Reach

1,276

GitHub followers

Developer Ecosystem

71

GitHub repos

24,827

GitHub stars

20

npm packages

5

HuggingFace models

Mentions by Platform

youtube

Haystack AI

Haystack AI

youtube

Haystack AI

Haystack AI

youtube

Haystack AI

Haystack AI

youtube

Haystack AI

Haystack AI

youtube

Haystack AI

Haystack AI

Pricing

tiered

Review Ratings

g2

4.7(20)

Recent Reviews

Brett B.

3/19/2026

What do you like best about Insites?I like the customizability and support from the Insites team the most. They provide live support at all times, and if anything doesn't scan correctly for a report, they investigate and address it immediately. The engineering and product roadmap team is great, constantly releasing new scans based on customer feedback. My favorite is the ChatGPT scan for small businesses, which shows their AI optimization status. The reports are designed well, making it easy for both prospects and sellers to understand the takeaways at a glance. Their integration with tools like Salesforce, Duda Website, and APIs are top-notch. Our sales team loves it because it takes a 10-20 minute research phase and distills it into a single robust report, making our sales reps appear as experts faster. Review collected by and hosted on G2.com.What do you dislike about Insites?I don't want to emphasize this in my review because it's nothing but positive marks from me, but they could make their white labeling settings a little more centralized and easy to set up. We needed a lot of support to get the white label settings fully completed because there are so many of them in different places in the platform. This is very minor and only a factor for a large organization such as ours. Review collected by and hosted on G2.com.

AJ R.

2/13/2026

What do you like best about Insites?I like that Insites is really helpful for initial sales scans, SEO optimizations, and reporting. It allows me to show off our product versus competitors during the sales phase and quickly resolve issues for clients during the service phase. I love their fantastic team, who are always available to ensure the tool is as effective as possible. The depth of the scans is very helpful, and the integrations with other platforms make resolving issues stress-free. Setting it up was very simple, and they helped tailor it to fit our exact needs. Review collected by and hosted on G2.com.What do you dislike about Insites?N/A Review collected by and hosted on G2.com.

Zack A.

1/9/2026

What do you like best about Insites?The process of using Insites has been streamlined to be intuitive and efficient. You don't need extensive marketing or tech background to work through submitting a report or viewing and making sense of the report when it finishes. All of the scores they give on a website are based from evidence they pull straight from the site, and it is often linked directly to the issue to make solving it easy. Review collected by and hosted on G2.com.What do you dislike about Insites?While it's not something that can necessarily be helped, reports do take quite a long time to process, and we heard frustration from clients thinking that it hasn't worked. Review collected by and hosted on G2.com.

Michael W.

9/18/2025

What do you like best about Insites?The interface is extremely user-friendly. Great sales tool and even better in fulfillment/account management. Review collected by and hosted on G2.com.What do you dislike about Insites?We love Insites and probably wouldn't change anything about it! It would be great if there were a social media auditing tool as well. Review collected by and hosted on G2.com.

Anisul O.

9/12/2025

What do you like best about Insites?What I like most is how quickly I can get meaningful insights without needing to spend hours digging around. The dashboard is clear, and the flow feels intuitive, so even team members who aren’t technical can use it with confidence. The reports are detailed but easy to explain to clients, which saves me a lot of back and forth. I also have to mention the support team — they’re genuinely responsive and friendly, which makes a huge difference when you’re on tight deadlines. Review collected by and hosted on G2.com.What do you dislike about Insites?There isn’t anything major that stands out as a downside. If I had to nitpick, I’d say I’d love to see more flexibility in customizing reports to match different client needs. The current setup works well enough, but having that extra bit of tailoring would make the platform even stronger. Review collected by and hosted on G2.com.

Vedat .

9/10/2025

What do you like best about Insites?The analysis quality is truly impressive and provides deep, actionable insights. The platform is extremely user-friendly, well-structured, and intuitive. What really stands out is the smooth onboarding process and the outstanding customer support they are always fast and helpful. I also highly appreciate the white-label functionality, which is perfect for agencies and resellers. Overall, Insites is a well-designed, reliable, and professional solution that meets all expectations. Review collected by and hosted on G2.com.What do you dislike about Insites?Honestly, there is nothing major to complain about. Everything worked as expected. If I had to mention something, perhaps the customization options for certain reports could be expanded in the future but that’s more of a suggestion than a downside. Review collected by and hosted on G2.com.

Verified User in Marketing and Advertising

9/9/2025

What do you like best about Insites?Insites is an agile and effective SEO auditing tool. In an incredible time, it analyzes a business's online presence (including local SEO, PPC, and metrics like Core Web Vitals), and also delivers a visual report that is clear and easily customizable with your brand. The support they have given us as an agency has been excellent, as well as the possibilities for adapting its look and feel to our brand. Review collected by and hosted on G2.com.What do you dislike about Insites?It is not something negative, but we lacked depth in integrations for the mass generation of reports. Review collected by and hosted on G2.com.

Verified User in Marketing and Advertising

8/6/2025

What do you like best about Insites?We were able to get started with Insites very quickly to power our marketing audit as part of our sales process. The team has been great to work with and provides us with ongoing support to help us unlock the full value of the tools. The product itself is very robust and meets our needs, allowing us to have a scalable process for adding value to both prospects and clients. We will continue to work with Insites for a long time. Review collected by and hosted on G2.com.What do you dislike about Insites?I don’t love the additional credits that are consumed for some of the more valuable features. I understand why they do it, I just wish it were a little more predictable Review collected by and hosted on G2.com.

Verified User in Online Media

8/5/2025

What do you like best about Insites?- Analyze the websites quickly and easily- Service for customers Review collected by and hosted on G2.com.What do you dislike about Insites?- Unfortunately, the Insites app apparently does not work on the iPad Review collected by and hosted on G2.com.

Jon R.

6/3/2025

What do you like best about Insites?I have used countless tools over the years, but Insites.com stands out by a mile. It’s become one of the most effective sales tools in our arsenal, helping us not just find leads, but close them. The platform delivers comprehensive, easy-to-understand website audits that deliver huge value to prospects. Instead of abstract promises, we’re able to show real, insights about a prospect’s online presence, highlighting issues they didn’t even know existed. This has opened doors, built trust quickly, and sparked far more productive conversations. The competitive analysis of the Insites reports always gets prospects attention and gets them asking for more information, thereby streamlining the sales process. The reports are detailed but digestible, which makes them perfect for business owners who aren’t technically minded. We’ve used them to frame proposals, win over sceptical prospects, and justify our SEO and web design recommendations with confidence. We also use it to monitor our success in optimising customer websites and identifying areas of improvement that we need to focus on. Review collected by and hosted on G2.com.What do you dislike about Insites?The audit designer takes some time to configure to be optimal and it would be good if there were some additional detail on each of the possible reporting lines. Review collected by and hosted on G2.com.

Mention Activity (Last 12 Weeks)

Platform Distribution

Sentiment Overview

Positive13% (2)

Neutral88% (14)

Negative0% (0)

Common Pain Points

API bill (1)

Recent Mentions

youtube

Haystack AI

Haystack AI

youtube

Haystack AI

Haystack AI

youtube

Haystack AI

Haystack AI

youtube

Haystack AI

Haystack AI

youtube

Haystack AI

Haystack AI

reddit@[unknown]6/19/2026

Anthropic and the era of Psychohistory

I've spent the last two years developing with Claude, 5 days a week, nearly 18 hours a day. Startup life has been rough. I've spent a lot of time with Claude, and all the other frontier models and I don't know about everyone else, but what the world got to preview for those few days.. Fable 5 going public. Well, it was nothing short of incredible. It was a model that seemed as if it had all the world's knowledge at the tip of its fingers, and could simply reach out and pluck out what was needed. Needles in the haystack ripe for the taking. Having used the model, and having followed Anthropic's PR strategy (whether you believe it's a PR stunt or truly due to aligned morals) it's been an interesting series of events to see unfold, and I can't stop thinking that someone is playing a game of chess here, and just maybe, coming out well on top. I don't put a lot of energy into speculation, but as I was walking to bed last night at 4am, a thought occurred to me. What if Anthropic, Dario and the team have done what Isaac Asimov put to paper all those years ago...that they've successfully developed a model capable of psychohistory, the mathematical formula to predict the future behaviour of the masses? That all this build-up, and the consequence of the US government, are all pieces onthe board falling into place and lining Claude, and Anthropic up for their ultimate checkmate, whatever that may be. Maybe just a fun ponderance, but in the world we currently live in, with technology advancing at the pace it is, who's to say otherwise when the lines of fact and fiction begin to blur. What are your thoughts? submitted by /u/TheDougMe [link] [comments]

reddit@[unknown]6/7/2026

Benchmarking Claude 3 against OpenAI's latest models - the context window difference is real.

I have been heavily invested in the OpenAI ecosystem but recently tested Claude's 200k context window for parsing massive documentation on CUDA kernels and TurboQuant compression. The recall accuracy on needle-in-a-haystack tasks for deep logic is genuinely impressive compared to current GPT builds. I have a few remaining promo slots on my workspace if any devs here want to benchmark it themselves for free. Grab one here:https://claude.ai/referral/awkne9penA?s=cowork&v=apps submitted by /u/DarkstarBinary [link] [comments]

reddit@[unknown]5/31/2026

Llama Surgery: Continuous Sparsification of Pre-Trained Language Models via Differentiable Ultrametric Topology Injection

Sequel to: Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention Abstract We present Llama Surgery, a method for injecting learned block-sparse attention topologies into pre-trained dense language models without retraining from scratch, distillation, or post-hoc pruning. Starting from a frozen Llama 3.1 8B, we surgically replace each attention layer with a Dynamic Topology Router that maps token embeddings onto the branches of a Bruhat-Tits p-adic tree via factorized Gumbel-Softmax routing. A Deterministic Collapse Initialization to achieve a Continuous Logit Homotopy guarantees that at step 0 the injected topology mask is identically dense, preserving the pre-trained manifold exactly. Over training, temperature annealing polarizes the soft routing assignments into hard binary masks, and a Switch Transformer-style load-balancing loss prevents routing collapse. We identify and resolve two critical failure modes: (1) gradient collapse through discrete masking operations, solved by a Straight-Through Estimator bridge that decouples the hard forward mask from the soft backward gradient; and (2) Attention Sink instability, where hard-masking the initial token causes softmax entropy collapse and syntactic degeneration, solved by permanently anchoring Token 0 in the visibility set. The resulting architecture is validated on Llama 3.1 8B fine-tuned on WikiText-2, achieving stable convergence and producing coherent, mathematically sophisticated text while maintaining dynamic block-sparse routing across all 32 transformer layers. A controlled semantic clustering experiment on TinyLlama-1.1B demonstrates that the router learns to assign tokens from distinct semantic domains (mathematics, natural language, code) to separate branches of the Bruhat-Tits tree using only the standard language modeling loss, with no explicit clustering objective. A Needle-In-A-Haystack (NIAH) retrieval experiment on TinyLlama-1.1B reveals that the router spontaneously organizes the context window into an ultrametric cophenetic hierarchy: the needle is isolated at maximum topological distance from the haystack (d_p = 6.88), and the ultrametric triangle inequality d(x,z) ≤ max(d(x,y), d(y,z)) is satisfied. Averaging over 32 attention heads yields a forest ensemble of distinct per-head ultrametric trees rather than a single global hierarchy. We further identify and resolve three critical float16 numerical failure modes—Gumbel-Softmax overflow, attention score overflow, and cumulative product backward instability—the last of which we solve via a novel cumprod→cummin substitution that exploits the binary structure of hard Gumbel-Softmax outputs. A custom Triton forward kernel with Attention Sink and Local Window support, pipelined for Ampere and Hopper architectures (num_warps=4, num_stages=3), executes the block-sparse prefill phase at O(N) theoretical complexity. To our knowledge, this is the first demonstration of differentiable ultrametric topology injection into a production-scale pre-trained LLM. https://github.com/sneed-and-feed/adelic-spectral-zeta/blob/main/papers/llama_surgery.md submitted by /u/LooseSwing88 [link] [comments]

reddit@[unknown]5/18/2026

The token-inflation posts are right. The thing that cut my Claude Code usage most was behavioral, not a tool.

Spent last week actually measuring where my Claude Code tokens go instead of just complaining about the May changes. The complaints are fair. But most of my burn was self-inflicted, and fixing that bought back more headroom than switching models would have. What actually worked, biggest win first: `/clear` between unrelated tasks. A stale 200k-token context riding along for a one-line fix was my single most expensive habit. Make it plan before it touches files. One planning pass, then execute. Cheaper and better than explore-edit-explore in a loop. Stop letting it re-read files it just touched. If it just edited a file it does not need to reopen it to "verify." Say so once in your rules. Search with a subagent, not the main thread. Grep-and-read across a repo dumps the whole haystack into your main context permanently. A subagent returns just the answer. Kill always-on and `-p` loops you are not watching. Background agents burning tokens while you sleep are most of the horror-story bills here. None of this needed a new subscription, a wrapper, or an MCP server. It was discipline I was too lazy to apply while the limits felt infinite. To be clear, none of this fixes the actual price hikes. It just stops you burning extra on top of them. What is the one habit that cut your usage most? Looking for the non-obvious ones, not "use a smaller model." submitted by /u/meliwat [link] [comments]

reddit@[unknown]5/9/2026

I got tired of the API bills for 100k+ context windows, so I built a persistent O(1) semantic memory state engine to compress history

Hey everyone, The entire industry right now is cheering for massive 1M+ context windows, but I think it's fundamentally the wrong approach. "Just add more RAM" is a trap. Stuffing 100k+ tokens of raw conversation history into a prompt doesn't just burn your API budget; it actually degrades the model's reasoning through the "lost in the middle" effect. I got tired of my AI agents drowning in their own chat histories, so I built an application-layer semantic memory engine called Semvec. The core shift is moving from an O(n) linear history to an O(1) constant-cost semantic state. But compressing chat history is just the baseline. When you treat memory as a fixed-size state vector, it unlocks entirely new architectures for agents that standard RAG or context-stuffing simply can't do: Persistent Coding Agents (MCP Integration) We built an MCP server for Claude Code and Cursor. Instead of dumping 5 whole files into the context window for a refactor, Semvec tracks the architectural invariants and past error patterns across different sessions. It gives your coding agent a persistent "Second Brain"—if it messed up a database schema in session 2, it remembers the "anti-resonance" rule in session 35 so it doesn't make the same mistake. Multi-Agent Swarms (Cortex) If you run multiple agents (like an Analyst and a Critic), they shouldn't have to read each other's 10,000-token transcripts to collaborate. With the Cortex module, agents exchange compressed StateVectorPackets and use a ConsensusEngine to merge their perspectives mathematically, sharing a global state with zero overhead. Enterprise Auditability & GDPR (Compliance Pack) If you run AI memory in production, you need to prove exactly what state the LLM acted on, and you need to be able to legally delete it. The compliance pack handles this via an append-only event store for deterministic replay, HMAC request signing, and GDPR Art. 17 "Right to be Forgotten" workflows with signed deletion certificates. The Benchmark Data: True Constant Cost: We ran a 50,000-turn stress test. While standard baseline history exploded past 75,000+ tokens, Semvec's footprint stayed flat at around ~550-625 tokens per turn. Quality goes UP: Because we strip out the noise and feed the LLM a highly concentrated "essence" of the context, blind A/B LLM-judge scores on LongBench-v2 actually increased for both small models (Llama 3.1-8B) and massive ones (gpt-oss-120B). A quick note on privacy & tracking: When I was initially designing the commercial licensing side, I experimented with an anti-abuse telemetry script to prevent automated clone-training. This was a terrible approach that compromised the local-first nature of the tool. I have completely ripped it out in v0.5.1, all versions containing it are yanked. Semvec for community users is now 100% air-gapped, local, with zero background tracking. The core engine is proprietary/patent-pending to bootstrap the project, but you can pip install the Python SDK and the MCP Server right now for free via the built-in community license. I'd love to hear your thoughts on the O(1) memory architecture vs. Prompt Caching, and if you think bounded semantic states are the future of long-running agents. Docs & Architecture: https://semvec-docs.pages.dev/ PyPI: https://pypi.org/project/semvec/ submitted by /u/scheitelpunk1337 [link] [comments]

reddit@[unknown]4/12/2026

Skill Seekers v3.5: 10 new source types, 12 LLM platforms, marketplace pipeline, agent-agnostic AI, and prompt injection scanner

Hey r/ClaudeAI — sharing the latest update on Skill Seekers, the open-source tool that converts documentation into Claude Code skills. A lot has changed since the v3.2 post, so here's what's new across 3 releases (v3.3 → v3.5.1). What's new 10 new source types (17 total) You can now generate skills from Notion, Confluence, HTML files, OpenAPI specs, AsciiDoc, PowerPoint, RSS feeds, man pages, chat exports (Slack/Discord), and unified multi-source configs — on top of the original web, GitHub, PDF, Word, EPUB, video, and local codebase sources. 12 LLM platforms Skills now package for Claude, OpenAI, Gemini, Kimi, DeepSeek, Qwen, OpenRouter, Together AI, Fireworks AI, OpenCode, Markdown, and MiniMax. Plus RAG framework exports for LangChain, LlamaIndex, Haystack, ChromaDB, FAISS, Weaviate, Qdrant, and Pinecone. Agent-agnostic AI enhancement Enhancement is no longer locked to Claude. The new AgentClient abstraction supports Claude, Kimi, Codex, Copilot, OpenCode, and custom agents. It auto-detects which agent to use from your API keys, or you can specify with --agent. Marketplace pipeline You can now publish skills directly to Claude Code plugin marketplace repositories and manage multiple marketplace registries. Config sources can be pushed and synced across repos. Prompt injection scanner A built-in workflow scans scraped content for injection patterns — role assumption, instruction overrides, delimiter injection, hidden instructions. Runs automatically as the first stage in default and security-focused workflows. Flags suspicious content without removing it so you can review. One-command auto-detection skill-seekers create https://docs.example.com/ skill-seekers create owner/repo skill-seekers create ./my-project skill-seekers create document.pdf One command figures out the source type and routes to the right scraper. No more separate subcommands. Headless browser rendering JavaScript SPA sites (React, Vue, etc.) that return empty HTML shells now work with --browser. Uses Playwright under the hood. Other highlights skill-seekers doctor health check command Kotlin language support in the C3.x codebase analysis pipeline Smart SPA discovery (sitemap.xml + llms.txt + browser nav) Unlimited pages by default (was capped at 500) 3100+ tests passing Full MCP server with 40 tools (works in Claude Code and Cursor/Windsurf) Links GitHub: github.com/yusufkaraaslan/Skill_Seekers PyPI: pip install skill-seekers Free and open source Built with Claude Code. Happy to answer questions or take feedback. submitted by /u/Critical-Pea-8782 [link] [comments]

reddit@[unknown]4/12/2026

KIV: 1M token context window on a RTX 4070 (12GB VRAM), no retraining, drop-in HuggingFace cache replacement - Works with any model that uses DynamicCache [P]

Been working on this for a bit and figured it was ready to share. KIV (K-Indexed V Materialization) is a middleware layer that replaces the standard KV cache in HuggingFace transformers with a tiered retrieval system. The short version: it keeps recent tokens exact in VRAM, moves old K/V to system RAM, and uses K vectors as a search index to pull back only the ~256 most relevant V entries per decode step. Results on a 4070 12GB with Gemma 4 E2B (4-bit): 1M tokens, 12MB KIV VRAM overhead, ~6.5GB total GPU usage 4.1 tok/s at 1M context (8-10 tok/s on GPU time), 12.9 tok/s at 4K 70/70 needle-in-haystack tests passed across 4K-32K Perfect phonebook lookup (unique names) at 58K tokens Prefill at 1M takes about 4.3 minutes (one-time cost) Decode is near-constant regardless of context length The core finding that makes this work: K vectors are smooth and structured, which makes them great search indices. V vectors are high-entropy and chaotic, so don't try to compress them, just retrieve them on demand. Use K to decide which V entries deserve to exist in VRAM at any given step. No model weights are modified. No retraining or distillation. It hooks into the HuggingFace cache interface and registers a custom attention function. The model has no idea it's talking to a tiered memory system. Works with any model that uses DynamicCache. Tested on Gemma 4, Qwen2.5, TinyLlama, and Phi-3.5 across MQA/GQA/MHA. There are real limitations and I'm upfront about them in the repo. Bounded prefill loses some info for dense similar-looking data. Collision disambiguation doesn't work but that's the 4-bit 2B model struggling, not the cache. Two-hop reasoning fails for the same reason. CPU RAM scales linearly (5.8GB at 1M tokens). Still actively optimizing decode speed, especially at longer contexts. The current bottleneck is CPU-to-GPU transfer for retrieved tokens, not the model itself. Plenty of room to improve here. GitHub: github.com/Babyhamsta/KIV (can be installed as a local pip package, no official pip package yet) Happy to answer questions about the architecture or results. Would love to see what happens on bigger models with more VRAM if anyone wants to try it. submitted by /u/ThyGreatOof [link] [comments]

reddit@[unknown]4/7/2026

[D] MemPalace claims 100% on LoCoMo and a "perfect score on LongMemEval." Its own BENCHMARKS.md documents why neither is meaningful.

A new open-source memory project called MemPalace launched yesterday claiming "100% on LoCoMo" and "the first perfect score ever recorded on LongMemEval. 500/500 questions, every category at 100%." The launch tweet went viral reaching over 1.5 million views while the repository picked up over 7,000 GitHub stars in less than 24 hours. The interesting thing is not that the headline numbers are inflated. The interesting thing is that the project's own BENCHMARKS.md file documents this in detail, while the launch tweet strips these caveats. Some of failure modes line up with the methodology disputes the field has been arguing about for over a year (Zep vs Mem0, Letta's "Filesystem All You Need" reproducibility post, etc.). 1. The LoCoMo 100% is a top_k bypass. The runner uses top_k=50. LoCoMo's ten conversations have 19, 19, 32, 29, 29, 28, 31, 30, 25, and 30 sessions respectively. Every conversation has fewer than 50 sessions, so top_k=50 retrieves the entire conversation as the candidate pool every time. The Sonnet rerank then does reading comprehension over all sessions. BENCHMARKS.md says this verbatim: The LoCoMo 100% result with top-k=50 has a structural issue: each of the 10 conversations has 19–32 sessions, but top-k=50 exceeds that count. This means the ground-truth session is always in the candidate pool regardless of the embedding model's ranking. The Sonnet rerank is essentially doing reading comprehension over all sessions - the embedding retrieval step is bypassed entirely. The honest LoCoMo numbers in the same file are 60.3% R@10 with no rerank and 88.9% R@10 with hybrid scoring and no LLM. Those are real and unremarkable. A 100% is also independently impossible on the published version of LoCoMo, since roughly 6.4% of the answer key contains hallucinated facts, wrong dates, and speaker attribution errors that any honest system will disagree with. 2. The LongMemEval "perfect score" is a metric category error. Published LongMemEval is end-to-end QA: retrieve from a haystack of prior chat sessions, generate an answer, GPT-4 judge marks it correct. Every score on the published leaderboard is the percentage of generated answers judged correct. The MemPalace LongMemEval runner does retrieval only. For each of the 500 questions it builds one document per session by concatenating only the user turns (assistant turns are not indexed at all), embeds with default ChromaDB embeddings (all-MiniLM-L6-v2), returns the top five sessions by cosine distance, and checks set membership against the gold session IDs. It computes both recall_any@5 and recall_all@5, and the project reports the softer one. It never generates an answer. It never invokes a judge. None of the LongMemEval numbers in this repository - not the 100%, not the 98.4% "held-out", not the 96.6% raw baseline - are LongMemEval scores in the sense the published leaderboard means. They are recall_any@5 retrieval numbers on the same dataset, which is a substantially easier task. Calling any of them a "perfect score on LongMemEval" is a metric category error. 3. The 100% itself is teaching to the test. The hybrid v4 mode that produces the 100% was built by inspecting the three remaining wrong answers in their dev set and writing targeted code for each one: a quoted-phrase boost for a question containing a specific phrase in single quotes, a person-name boost for a question about someone named Rachel, and "I still remember" / "when I was in high school" patterns for a question about a high school reunion. Three patches for three specific questions. BENCHMARKS.md, line 461, verbatim: This is teaching to the test. The fixes were designed around the exact failure cases, not discovered by analyzing general failure patterns. 4. Marketed features that don't exist in the code. The launch post lists "contradiction detection catches wrong names, wrong pronouns, wrong ages before you ever see them" as a feature. mempalace/knowledge_graph.py contains zero occurrences of "contradict". The only deduplication logic is an exact-match check on (subject, predicate, object) triples that blocks identical triples from being added twice. Conflicting facts about the same subject can accumulate indefinitely. 5. "30x lossless compression" is measurably lossy in the project's own benchmarks. The compression module mempalace/dialect.py truncates sentences at 55 characters, filters by keyword frequency, and provides a decode() function that splits the compressed string into a header dictionary without reconstructing the original text. There is no round-trip. The same BENCHMARKS.md reports results_raw_full500.jsonl at 96.6% R@5 and results_aaak_full500.jsonl at 84.2% R@5 — a 12.4 percentage point drop on the same dataset and the same metric, run by the project itself. Lossless compression cannot cause a measured quality drop. Why this matters for the benchmark conversation. The field needs benchmarks where judge reliability is adversarially validated, an

reddit@[unknown]3/30/2026

ostk – a single Rust binary that coordinates AI agents via filesystem and saves tokens

I've been building something entirely with Claude Code. Launching agent teams, recursively improving and proving the value. I'd call it an operating system for AI agents. Some may debate that. Read more: https://ostk.ai In February, I started developing fcp-drawio, which I called "file-context protocol," a way to represent complex draw.io diagrams for LLMs: it lets them express their intent for what they want to diagram, not how to write XML to do so. I continued exploring and found a pattern that exploded into an invisible coordination layer between humans and agents. Agents run in the kernel's loop. The human approves, denies, redirects — every decision logged. The agents see tools; they don't see the governance. On March 5th, I started a big push to unify all of the concepts I'd put together. The numbers show the trajectory in savings: One Rust binary. Agentfiles define model, tools, and budgets. Pin files restrict execution scope. No vendor lock-in — switch models mid-conversation, hand work between them. The kernel coordinates through the filesystem, inside your git repo. Agents connect via socket daemon. Approvals route to the operator. Audit trail captures every tool call and decision. Inference is becoming a commodity — what matters is which model solves it correctly for less. Bench results at needle-bench.cc 26 models, 34 real-world debugging problems, each run blind in a Docker container with one prompt — "find the needle." Same prompt, same tools, with and without the kernel. 793 paired runs. Bare: 36% solve rate. Kernel: 69%. +33 percentage points. 22 of 26 models improved. The kernel took models scoring 0-9% bare — Gemini Flash, qwen-plus, devstral, deepseek-chat — and pushed them to 25-89%. Models that already solved everything (Opus, DeepSeek R1, Grok 4.1) used 61-81% fewer tokens doing it. One model regressed. The results suggest something I didn't expect when I started building this: the coordination layer matters more than the model. A $0.001 run Gemini Flash with the kernel outperforms a $0.03/run GPT-4o without it. The cheapest correct answer wins, and the kernel makes cheap answers correct more often. curl -fsSL https://ostk.ai/install | sh ostk init ostk boot Free and open now. The vision is a composable, distributed OS, and it'll take more than me to build it right. submitted by /u/scotty2012 [link] [comments]

reddit@[unknown]3/30/2026

Tokens sometimes output in chunks of 8?

Usually (after thinking) Claude's responses are streamed fluidly, a token at a time. But sometimes, particularly when I ask a very technical question, responses will arrive in fixed-size chunks of 8 tokens at a time, one block after another. Whatever controls this "chunking" behavior, it's immediately after my prompt. Responses never start fluid and then turn to chunks; it's either fluid throughout or the first block of tokens is a chunk. Chunks are always the same size (afaict) of 8 tokens or so. The past two prompts that elicited "chunked" responses were on 1) David Chalmers's views on the philosophy of qualia, and 2) attending multiple needles in haystacks in hierarchical attention schemes, so relatively esoteric topics. But these subjects don't reliably reproduce the behavior, so they could be red herrings. Anyone else experience this kind of chunking? It's not a bother, but it makes me curious about what sort of model routing behind the scenes might cause it. submitted by /u/MadGenderScientist [link] [comments]

reddit@[unknown]3/19/2026

[R] Doc-to-LoRA: Learning to Instantly Internalize Contexts from Sakana AI

This is cool paper! Creating loras from docs on the fly using a hypernetwork. "Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM's native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior." https://arxiv.org/abs/2602.15902 submitted by /u/Happysedits [link] [comments]

Integrations

OpenAIAnthropicMistralHugging FaceWeaviatePineconeElasticsearchKubernetes

Categories

AI/MLDevOpsDeveloper Tools

Repository Audit Available

Deep analysis of deepset-ai/haystack — architecture, costs, security, dependencies & more

View Full Audit

Haystack Alternatives

Compare similar framework tools

All framework Tools

Browse the full category

Frequently Asked Questions

How much does Haystack cost?▼

Haystack uses a tiered pricing model. Visit their website for current pricing details.

What do users think of Haystack?▼

Haystack has an average rating of 4.7 out of 5 stars based on 20 reviews from G2, Capterra, and TrustRadius.

What are the main features of Haystack?▼

Key features include: Private, secure engineering support, Best practices templates deployment guides, Access to flexible services, Flexible pricing based on company size, Visual, code-aligned pipeline design, Data, retrieval, and testing workflows, Secure access controls and auditability, Scalable cloud or on-prem deployment.

What is Haystack used for?▼

Haystack is commonly used for: Operate at Enterprise Scale, Visual, code-aligned pipeline design, Data, retrieval, and testing workflows, Secure access controls and auditability, Scalable cloud or on-prem deployment.

What does Haystack integrate with?▼

Haystack integrates with: OpenAI, Anthropic, Mistral, Hugging Face, Weaviate, Pinecone, Elasticsearch, Kubernetes.