LLM inference in C/C++. Contribute to ggml-org/llama.cpp development by creating an account on GitHub.
"Llama.cpp" is praised for its efficient performance and ease of use, which makes it a popular choice among developers. However, some users express frustrations with occasional bugs and a perceived lack of comprehensive documentation. The sentiment around pricing indicates satisfaction, as users feel the tool offers good value for its capabilities. Overall, "llama.cpp" enjoys a strong reputation in the developer community, bolstered by its active contributions and support.
Mentions (30d)
5
Reviews
0
Platforms
3
GitHub Stars
101,000
16,272 forks
"Llama.cpp" is praised for its efficient performance and ease of use, which makes it a popular choice among developers. However, some users express frustrations with occasional bugs and a perceived lack of comprehensive documentation. The sentiment around pricing indicates satisfaction, as users feel the tool offers good value for its capabilities. Overall, "llama.cpp" enjoys a strong reputation in the developer community, bolstered by its active contributions and support.
Features
Use Cases
Industry
information technology & services
Employees
6,200
Funding Stage
Other
Total Funding
$7.9B
101,000
GitHub stars
20
npm packages
4
HuggingFace models
Brazil, Indonesia, Japan, Germany, and India fueled a massive surge in 2025, adding nearly 36 million new developers to GitHub. 🌏 India alone added 5.2 million. 🇮🇳
Brazil, Indonesia, Japan, Germany, and India fueled a massive surge in 2025, adding nearly 36 million new developers to GitHub. 🌏 India alone added 5.2 million. 🇮🇳
View originalhttps://t.co/yGiqw0xbji
https://t.co/yGiqw0xbji
View originalStart work on your computer, continue your local session anywhere. 📲 Remote control for GitHub Copilot CLI and @code sessions is now generally available. https://t.co/wwSEBd5lqL https://t.co/Yc5R6tB
Start work on your computer, continue your local session anywhere. 📲 Remote control for GitHub Copilot CLI and @code sessions is now generally available. https://t.co/wwSEBd5lqL https://t.co/Yc5R6tBfBl
View originalYou don't have to level up to contribute to open source. You level up by contributing to open source. Not sure how to get started? Check out our latest GitHub for Beginners episode. https://t.co/Jyze
You don't have to level up to contribute to open source. You level up by contributing to open source. Not sure how to get started? Check out our latest GitHub for Beginners episode. https://t.co/Jyze45KoHo https://t.co/DCqAFACo35
View originalInteractive and non-interactive: these are the two main modes in Copilot CLI. 💻 Our beginner series breaks down the difference, plus how and when to use each one. 💡👇 https://t.co/gZ7GetcgTo
Interactive and non-interactive: these are the two main modes in Copilot CLI. 💻 Our beginner series breaks down the difference, plus how and when to use each one. 💡👇 https://t.co/gZ7GetcgTo
View originalSome open source projects don't just survive. They flat-out refuse to bite the dust. ⚔️ We looked at 10 roguelikes still going strong years (sometimes decades) after launch. Here's what their maintai
Some open source projects don't just survive. They flat-out refuse to bite the dust. ⚔️ We looked at 10 roguelikes still going strong years (sometimes decades) after launch. Here's what their maintainers and communities can teach the rest of open source about longevity. 💡
View originalNeed help picking the right emoji (like we did for this post)? 🤔 @cassidoo made an emoji list generator with Copilot CLI. Learn how she did it and pick up tools and tricks for your next project. 👇
Need help picking the right emoji (like we did for this post)? 🤔 @cassidoo made an emoji list generator with Copilot CLI. Learn how she did it and pick up tools and tricks for your next project. 👇 https://t.co/13xwmu6tE9 https://t.co/pCy8PGfUIE
View originalCooking up something new 🧑🍳 Join the waitlist for early access to technical preview of the GitHub Copilot app 👇 https://t.co/ODODKdvzOA https://t.co/1h7AJPAhiH
Cooking up something new 🧑🍳 Join the waitlist for early access to technical preview of the GitHub Copilot app 👇 https://t.co/ODODKdvzOA https://t.co/1h7AJPAhiH
View originalNew to open source? Learn how to find a good first issue, open a pull request, and make your first contribution with GitHub for Beginners. 👇 https://t.co/PNRb746zCh
New to open source? Learn how to find a good first issue, open a pull request, and make your first contribution with GitHub for Beginners. 👇 https://t.co/PNRb746zCh
View originalRT @cinnamon_msft: GitHub Copilot CLI now has a statusline feature! Here's how to set it up with Oh My Posh ❤️🔥 https://t.co/DpNR8Bjt7G
RT @cinnamon_msft: GitHub Copilot CLI now has a statusline feature! Here's how to set it up with Oh My Posh ❤️🔥 https://t.co/DpNR8Bjt7G
View originalFind out what vulnerabilities are lurking in your code. 👀 GitHub's new Code Security Risk Assessment scans your organization's code and delivers a vulnerability dashboard broken down by severity, la
Find out what vulnerabilities are lurking in your code. 👀 GitHub's new Code Security Risk Assessment scans your organization's code and delivers a vulnerability dashboard broken down by severity, language, and repo. No config, no commitment. Run your free assessment now.
View originalNew to GitHub Copilot CLI? Our beginner series makes it easy to get started. Bring agentic AI right to your terminal and speed up your workflow. 💻✨ Get the tutorial here. 👇 https://t.co/bNLnpdgTxr
New to GitHub Copilot CLI? Our beginner series makes it easy to get started. Bring agentic AI right to your terminal and speed up your workflow. 💻✨ Get the tutorial here. 👇 https://t.co/bNLnpdgTxr
View originalHugging Face co-founder says Qwen 3.6 27B running on airplane mode is close to latest Opus in Claude Code
I've been using AI Desktop 98 heavily to run local llms like qwen on my iPhone. submitted by /u/ImaginaryRea1ity [link] [comments]
View originalTanStack now has TanStack AI. 👀 Here's what to expect from this new, fully open-source toolkit. ▶️ https://t.co/AjmutvBYve
TanStack now has TanStack AI. 👀 Here's what to expect from this new, fully open-source toolkit. ▶️ https://t.co/AjmutvBYve
View originalOf course GitHub will be at Microsoft Build. 🎉 Dive into real code, real systems, and real workflows with the teams building and scaling AI. Join us for exclusive events like: • Lots of GitHub sessi
Of course GitHub will be at Microsoft Build. 🎉 Dive into real code, real systems, and real workflows with the teams building and scaling AI. Join us for exclusive events like: • Lots of GitHub sessions • GitHub Social Club • OpenClaw meetup at GitHub HQ Not registered for https://t.co/SRz9hfizRr
View originalI built persistent memory for Claude — local stack, MCP integration, 39ms retrieval. Sharing the architecture.
If you use Claude heavily, you've felt this: every session starts from zero. You re-explain context, Claude helps, the window closes, and the next session has no idea what you decided yesterday. The standard workaround is a markdown wiki Claude reads — but as the wiki grows, every "what did we decide about X" question burns thousands of tokens grepping and re-reading whole pages. I spent the last few weeks building a persistent memory layer to fix both problems. It runs entirely on my own machine, integrates via MCP, and lives between Claude and my existing wiki. Sharing the architecture and what I learned in case anyone wants to build their own. What it does Semantic retrieval over my wiki. Instead of Claude grepping pages, my MCP server returns the most relevant chunks for any query in ~50ms. 82% mean token reduction on a 10-query eval set vs the grep+Read baseline. F1 retrieval quality is also better — cheaper and more accurate. Session crystallization. End-of-session, conversations get compressed into a structured "L4 node" with summary + decisions + open threads, indexed alongside wiki content. Tomorrow I can ask "what did we decide about X" and Claude pulls last session's decision verbatim. Lazy-spawned local models. Embedder + chat model run as subprocesses that the supervisor spawns on first use and reaps after 1 hour idle. Boot cost is zero — nothing loaded until needed. The architecture (four layers) Inspired by Andrej Karpathy's writing on LLM-native wikis, then formalized into a build spec: L0 — append-only event log (SQLite). Every input/output, content-hashed. L1 — structured facts with confidence + decay (deferred to next phase) L2/L3 — derived prose + cross-cutting summaries (the hand-edited wiki plays this role for now) L4 — crystallized session nodes. Summary, decisions, open threads. Indexed in the same vector store as wiki chunks so retrieval finds both naturally. The stack Qdrant in Docker for vector search llama.cpp running Qwen3-Embedding-4B (GPU) and Qwen3.5-2B-Q4_K_M (CPU) FastMCP server exposing 7 tools (retrieve, crystallize_session, list_sessions, get_l4_node, index_status, reindex, shutdown_models) Cowork plugin for Claude Desktop integration; also works with Claude Code via standard MCP config No cloud, no API keys, $0 marginal cost per query. Numbers Token reduction: 82.7% mean, 86.2% median vs grep+Read baseline Retrieval F1: 0.50 vs 0.20 baseline Embed cold-start: ~4s. Hot-path p95: 39ms (was 2241ms before fixing one specific bug — see below) L4 session retrieval eval: 0.920 mean score (gate 0.6) 738 chunks currently indexed across 104 markdown files The most useful thing I learned Hot-path retrieve was inexplicably stuck at 2241ms p95 even though the embedding model was fully GPU-resident on a 4070 Ti Super. Spent hours blaming GPU offload, prompt cache, KV pre-allocation. The actual cause: every httpx.post() was opening a fresh TCP connection, and Windows localhost handshakes take ~2 seconds. A 5-line change — switching to a persistent httpx.Client with keep-alive — dropped p95 to 39ms. 57× speedup. Lesson: latency that's suspiciously consistent (2240, 2237, 2241, 2227, 2239 ms) is a fixed cost, not a compute cost. If your local-MCP integration feels slow on Windows, check connection reuse before you blame the model. A few other things that surprised me Qwen3 thinking mode silently consumes the generation budget. Crystallization was returning empty content. Logs showed exactly 2000 tokens generated (the cap). Turned out Qwen3 emits ... blocks the chat handler strips before populating message.content. With JSON grammar enforced, the model spent all 2000 tokens "thinking" and never emitted JSON. Fix: pass chat_template_kwargs: {enable_thinking: false} via extra_body (requires --jinja on llama-server). The MCP plugin needed to register against the right config file. Cowork (Claude Desktop's agentic mode) doesn't read ~/.claude.json like Claude Code does. The first attempt at MCP registration silently went to the wrong file. The fix was packaging the LKS service as a proper Cowork plugin (.plugin bundle) — Cowork has a plugin system distinct from raw MCP server registration. If you're trying to wire a custom MCP server into Cowork, this is the path. What it doesn't do (yet) No automatic conversation capture — L0 ingestion is manual or via end-of-session crystallization No L1 fact extraction yet (next phase) — retrieval is over markdown chunks + L4 nodes today Wiki is still source-of-truth; no automatic conflict resolution Solo deployment only; no federation or multi-user Tested on Windows; Linux/Mac would need a small tweak to the supervisor (it uses subprocess.CREATE_NEW_PROCESS_GROUP for clean Windows termination) Full write-up Architecture, phased build narrative, all five lessons-learned bug stories, the setup walkthrough, and the roadmap: https://gist.github.com/tyoung515-svg/5fd5279f46d935f517cda89146c94685
View originalRepository Audit Available
Deep analysis of ggerganov/llama.cpp — architecture, costs, security, dependencies & more
llama.cpp uses a subscription + tiered pricing model. Visit their website for current pricing details.
Key features include: Plain C/C++ implementation without any dependencies, Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks, AVX, AVX2, AVX512 and AMX support for x86 architectures, RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures, 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use, Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA), Vulkan and SYCL backend support, CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity.
llama.cpp is commonly used for: Real-time language translation for applications, Chatbot development for customer service, Content generation for blogs and articles, Sentiment analysis for social media monitoring, Code generation and assistance for developers, Personalized recommendations in e-commerce.
llama.cpp integrates with: TensorFlow for model training, PyTorch for deep learning frameworks, Hugging Face Transformers for model access, Docker for containerization, Kubernetes for orchestration, Flask for web application deployment, FastAPI for building APIs, Streamlit for interactive data applications, Unity for game development, OpenAI API for enhanced functionalities.
Sentdex
Creator at Python & AI YouTube
3 mentions
llama.cpp has a public GitHub repository with 101,000 stars.
Based on user reviews and social mentions, the most common pain points are: down, breaking.
Based on 99 social mentions analyzed, 11% of sentiment is positive, 89% neutral, and 0% negative.