Code Llama, which is built on top of Llama 2, is free for research and commercial use.
There are no direct reviews or mentions for "CodeLlama" present in the provided text, making it difficult to determine user sentiment specifically for this software. The social mentions largely highlight advancements and products related to Meta's AI technologies and collaborations, indicating an ecosystem of innovative AI applications, but provide no explicit feedback or critiques about CodeLlama. As such, potential users should seek specific reviews or more focused discussions about CodeLlama to get an accurate understanding of its strengths, complaints, pricing perceptions, and reputation.
Mentions (30d)
18
Reviews
0
Platforms
4
GitHub Stars
16,334
1,937 forks
There are no direct reviews or mentions for "CodeLlama" present in the provided text, making it difficult to determine user sentiment specifically for this software. The social mentions largely highlight advancements and products related to Meta's AI technologies and collaborations, indicating an ecosystem of innovative AI applications, but provide no explicit feedback or critiques about CodeLlama. As such, potential users should seek specific reviews or more focused discussions about CodeLlama to get an accurate understanding of its strengths, complaints, pricing perceptions, and reputation.
Features
Use Cases
Industry
information technology & services
Employees
77,000
10,559
GitHub followers
12
GitHub repos
16,334
GitHub stars
20
npm packages
40
HuggingFace models
Imagine controlling your devices with a subtle hand or finger gesture. Our cutting-edge research turns intent and muscle signals into seamless computer control. This breakthrough wrist technology is r
Imagine controlling your devices with a subtle hand or finger gesture. Our cutting-edge research turns intent and muscle signals into seamless computer control. This breakthrough wrist technology is redefining how we interact with computers—intuitive, precise, and ready for the https://t.co/2dXERZYqkY
View originalClaude Code has 240+ models via NVIDIA NIM gateway
TIL Claude Code has 240+ models via NVIDIA NIM gateway — Nemotron-3 120B for agentic coding is surprisingly good So I was messing around with /model in Claude Code today and noticed something most people probably don't know about — after the standard Claude models (Opus, Sonnet, Haiku), there's a whole NVIDIA NIM gateway section with +239 additional models you can switch to mid-session. Some of the models I spotted: nvidia/nemotron-3-super-120b-a12b (with and without thinking mode) 01-ai/yi-large abacusai/dracarys-llama-3.1-70b-instruct ...and hundreds more I've been running the Nemotron thinking variant for multi-file refactoring and it's genuinely solid. It reasons through changes before touching your code — exactly what you want for agentic tasks. Latency is higher than Claude obviously, but if you're burning through Opus credits on long sessions this is worth experimenting with. How to try it: Open any Claude Code session Run /model Scroll past the four standard Claude options — NIM models appear below Hit d to set one as your session default, or pass --model at launch Anyone else been routing Claude Code through NIM? Curious what models people have had luck with — especially for Python or Rust codegen. submitted by /u/shadowBladeO4 [link] [comments]
View originalOn "harness engineering": Are people actually building things or just giving impressive labels to "tweaking?"
I see a lot of posts and videos talking about harness engineering, or it could be context engineering, RAG, etc. The thing is, most of them talk about the concepts. And then I hear about all these people actually doing it. And my question is about this disconnect: what does it look like in practice? The way I understand it tools like Claude Code or OpenAI Codex are agents, and the logic that controls what gets fed to the model is the harness. So when people talk about "engineering the context," are they: writing actual programs CLI tools, pipelines, custom API wrappers that manage what gets sent to the model? or mostly just structuring their prompts well and calling it engineering? Same question for RAG--or any other oft-discussed topics: are people actually building retrieval pipelines from scratch, or are they standing up LlamaIndex / Mem0 and saying they're "using RAG" to infomaxx their AI agents? Not trying to be dismissive. I'm genuinely curious about what people are actually doing when they say they have applied these concepts to their agentic workflows. submitted by /u/josh_apptility [link] [comments]
View originalHugging Face co-founder says Qwen 3.6 27B running on airplane mode is close to latest Opus in Claude Code
I've been using AI Desktop 98 heavily to run local llms like qwen on my iPhone. submitted by /u/ImaginaryRea1ity [link] [comments]
View originalLLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]
https://preview.redd.it/668yjlucu80h1.png?width=2800&format=png&auto=webp&s=ca541488abb5262b06cfc13a9586efb19f24d644 I built a small website called LLM Win: https://llm-win.com It turns LLM benchmark results into a directed graph: If model A beats model B on benchmark X, add an edge A -> B. Then it searches for the shortest transitive chain between two models. The meme version is: Can LLaMA 2 7B beat Claude Opus 4.7? In an absurd transitive benchmark sense, sometimes yes. But I added a Report tab because the structure itself seems useful for model evaluation. Some experimental findings from the current Artificial Analysis data snapshot: Weak-to-strong reachability is high. I checked 126,937 pairs where the source model has lower Intelligence Index than the target model. 119,514 of them are reachable through benchmark win chains, for a reachable rate of 94.2%. Most paths are short. Among reachable weak-to-strong pairs: 2-3 hop paths account for 91.4%. So this is not mostly long-chain cherry-picking. Direct reversal triples are abundant. After treating non-positive benchmark values as missing, there are still about 119k direct weak-over-strong triples of the form: (source model, target model, benchmark), where the source has lower Intelligence Index but higher score on that benchmark. Some benchmarks create more reversals than others. Current high-reversal / useful-signal candidates include: Humanity's Last Exam, IFBench, AIME 2025, TAU2, SciCode Different benchmarks have different interpretations. For example, IFBench has roughly: reversal rate: ~17.5%, coverage: ~80.0%, correlation with Intelligence Index: r≈0.82. This suggests it may provide an independent skill signal rather than simply duplicating the overall ranking. My current interpretation: LLM rankings are better represented as a benchmark-specific capability graph than as a single ladder. Some reversals probably reflect real specialization; some reflect benchmark coverage limits, volatility, or measurement noise. The next question is whether reversal structure can help build better evaluation metrics: identify specialist models; identify volatile benchmarks; build robust generalist scores; select complementary benchmark sets; decompose models into capability fingerprints. Curious what people think: Is benchmark reversal structure a useful evaluation signal, or mostly an artifact of noisy benchmarks? submitted by /u/Spico197 [link] [comments]
View originalI got tired of the API bills for 100k+ context windows, so I built a persistent O(1) semantic memory state engine to compress history
Hey everyone, The entire industry right now is cheering for massive 1M+ context windows, but I think it's fundamentally the wrong approach. "Just add more RAM" is a trap. Stuffing 100k+ tokens of raw conversation history into a prompt doesn't just burn your API budget; it actually degrades the model's reasoning through the "lost in the middle" effect. I got tired of my AI agents drowning in their own chat histories, so I built an application-layer semantic memory engine called Semvec. The core shift is moving from an O(n) linear history to an O(1) constant-cost semantic state. But compressing chat history is just the baseline. When you treat memory as a fixed-size state vector, it unlocks entirely new architectures for agents that standard RAG or context-stuffing simply can't do: Persistent Coding Agents (MCP Integration) We built an MCP server for Claude Code and Cursor. Instead of dumping 5 whole files into the context window for a refactor, Semvec tracks the architectural invariants and past error patterns across different sessions. It gives your coding agent a persistent "Second Brain"—if it messed up a database schema in session 2, it remembers the "anti-resonance" rule in session 35 so it doesn't make the same mistake. Multi-Agent Swarms (Cortex) If you run multiple agents (like an Analyst and a Critic), they shouldn't have to read each other's 10,000-token transcripts to collaborate. With the Cortex module, agents exchange compressed StateVectorPackets and use a ConsensusEngine to merge their perspectives mathematically, sharing a global state with zero overhead. Enterprise Auditability & GDPR (Compliance Pack) If you run AI memory in production, you need to prove exactly what state the LLM acted on, and you need to be able to legally delete it. The compliance pack handles this via an append-only event store for deterministic replay, HMAC request signing, and GDPR Art. 17 "Right to be Forgotten" workflows with signed deletion certificates. The Benchmark Data: True Constant Cost: We ran a 50,000-turn stress test. While standard baseline history exploded past 75,000+ tokens, Semvec's footprint stayed flat at around ~550-625 tokens per turn. Quality goes UP: Because we strip out the noise and feed the LLM a highly concentrated "essence" of the context, blind A/B LLM-judge scores on LongBench-v2 actually increased for both small models (Llama 3.1-8B) and massive ones (gpt-oss-120B). A quick note on privacy & tracking: When I was initially designing the commercial licensing side, I experimented with an anti-abuse telemetry script to prevent automated clone-training. This was a terrible approach that compromised the local-first nature of the tool. I have completely ripped it out in v0.5.1, all versions containing it are yanked. Semvec for community users is now 100% air-gapped, local, with zero background tracking. The core engine is proprietary/patent-pending to bootstrap the project, but you can pip install the Python SDK and the MCP Server right now for free via the built-in community license. I'd love to hear your thoughts on the O(1) memory architecture vs. Prompt Caching, and if you think bounded semantic states are the future of long-running agents. Docs & Architecture: https://semvec-docs.pages.dev/ PyPI: https://pypi.org/project/semvec/ submitted by /u/scheitelpunk1337 [link] [comments]
View originalI built a benchmark for AI “memory” in coding agents. looking for others to beat it.
Most AI memory benchmarks test semantic recall. But coding agents don't really fail like that. They don't just "forget", they break their own earlier decisions while they're still in the code. So I built a benchmark for that. It checks if an agent can actually stay consistent with project rules WHILE it's working, not just after the fact. It looks at things like: whether edits actually respect earlier architectural decisions if behavior stays consistent across multiple sessions (even when you throw noise at it) whether retrieval kicks in at the right moment — not just "yeah it's in memory somewhere" Repo (full harness + dataset + scoring): https://github.com/Alienfader/continuity-benchmarks Early numbers vs baseline + the usual RAG-style memory setups: ~3× better action alignment way stronger multi-session consistency retrieval timing matters way more than retrieval just being there I'm not saying this is the final word on agent memory. But it's exposing a failure mode most benchmarks aren't even looking at. So heres the challenge If you're building an agent memory system, RAG for code, long-context coding agents, persistent state / memory layers, run it on this benchmark. Drop your results, your setup, your comparisons. I really wanna see how tools like LangChain, LlamaIndex, and custom RAG stacks hold up in mutation-heavy workflows. We need memory systems we can actually compare, not just ones that sound good on paper. https://preview.redd.it/dkm2ulxsyzzg1.png?width=2624&format=png&auto=webp&s=67f0299395708818aa3d7346ddae2ad0c5c4a6ba submitted by /u/Alienfader [link] [comments]
View originalI built persistent memory for Claude — local stack, MCP integration, 39ms retrieval. Sharing the architecture.
If you use Claude heavily, you've felt this: every session starts from zero. You re-explain context, Claude helps, the window closes, and the next session has no idea what you decided yesterday. The standard workaround is a markdown wiki Claude reads — but as the wiki grows, every "what did we decide about X" question burns thousands of tokens grepping and re-reading whole pages. I spent the last few weeks building a persistent memory layer to fix both problems. It runs entirely on my own machine, integrates via MCP, and lives between Claude and my existing wiki. Sharing the architecture and what I learned in case anyone wants to build their own. What it does Semantic retrieval over my wiki. Instead of Claude grepping pages, my MCP server returns the most relevant chunks for any query in ~50ms. 82% mean token reduction on a 10-query eval set vs the grep+Read baseline. F1 retrieval quality is also better — cheaper and more accurate. Session crystallization. End-of-session, conversations get compressed into a structured "L4 node" with summary + decisions + open threads, indexed alongside wiki content. Tomorrow I can ask "what did we decide about X" and Claude pulls last session's decision verbatim. Lazy-spawned local models. Embedder + chat model run as subprocesses that the supervisor spawns on first use and reaps after 1 hour idle. Boot cost is zero — nothing loaded until needed. The architecture (four layers) Inspired by Andrej Karpathy's writing on LLM-native wikis, then formalized into a build spec: L0 — append-only event log (SQLite). Every input/output, content-hashed. L1 — structured facts with confidence + decay (deferred to next phase) L2/L3 — derived prose + cross-cutting summaries (the hand-edited wiki plays this role for now) L4 — crystallized session nodes. Summary, decisions, open threads. Indexed in the same vector store as wiki chunks so retrieval finds both naturally. The stack Qdrant in Docker for vector search llama.cpp running Qwen3-Embedding-4B (GPU) and Qwen3.5-2B-Q4_K_M (CPU) FastMCP server exposing 7 tools (retrieve, crystallize_session, list_sessions, get_l4_node, index_status, reindex, shutdown_models) Cowork plugin for Claude Desktop integration; also works with Claude Code via standard MCP config No cloud, no API keys, $0 marginal cost per query. Numbers Token reduction: 82.7% mean, 86.2% median vs grep+Read baseline Retrieval F1: 0.50 vs 0.20 baseline Embed cold-start: ~4s. Hot-path p95: 39ms (was 2241ms before fixing one specific bug — see below) L4 session retrieval eval: 0.920 mean score (gate 0.6) 738 chunks currently indexed across 104 markdown files The most useful thing I learned Hot-path retrieve was inexplicably stuck at 2241ms p95 even though the embedding model was fully GPU-resident on a 4070 Ti Super. Spent hours blaming GPU offload, prompt cache, KV pre-allocation. The actual cause: every httpx.post() was opening a fresh TCP connection, and Windows localhost handshakes take ~2 seconds. A 5-line change — switching to a persistent httpx.Client with keep-alive — dropped p95 to 39ms. 57× speedup. Lesson: latency that's suspiciously consistent (2240, 2237, 2241, 2227, 2239 ms) is a fixed cost, not a compute cost. If your local-MCP integration feels slow on Windows, check connection reuse before you blame the model. A few other things that surprised me Qwen3 thinking mode silently consumes the generation budget. Crystallization was returning empty content. Logs showed exactly 2000 tokens generated (the cap). Turned out Qwen3 emits ... blocks the chat handler strips before populating message.content. With JSON grammar enforced, the model spent all 2000 tokens "thinking" and never emitted JSON. Fix: pass chat_template_kwargs: {enable_thinking: false} via extra_body (requires --jinja on llama-server). The MCP plugin needed to register against the right config file. Cowork (Claude Desktop's agentic mode) doesn't read ~/.claude.json like Claude Code does. The first attempt at MCP registration silently went to the wrong file. The fix was packaging the LKS service as a proper Cowork plugin (.plugin bundle) — Cowork has a plugin system distinct from raw MCP server registration. If you're trying to wire a custom MCP server into Cowork, this is the path. What it doesn't do (yet) No automatic conversation capture — L0 ingestion is manual or via end-of-session crystallization No L1 fact extraction yet (next phase) — retrieval is over markdown chunks + L4 nodes today Wiki is still source-of-truth; no automatic conflict resolution Solo deployment only; no federation or multi-user Tested on Windows; Linux/Mac would need a small tweak to the supervisor (it uses subprocess.CREATE_NEW_PROCESS_GROUP for clean Windows termination) Full write-up Architecture, phased build narrative, all five lessons-learned bug stories, the setup walkthrough, and the roadmap: https://gist.github.com/tyoung515-svg/5fd5279f46d935f517cda89146c94685
View originalThe Anthropic-xAI compute deal isn't really about Claude limits
Everyone's reading the Anthropic-xAI announcement as "Claude Code limits doubled, nice." That's the surface. The underlying news is the 300MW / 220k GPU commitment from a competitor's stack, and that signals a few things worth thinking through. Three reads that aren't getting enough air time: Anthropic signed a compute deal with a competitor's CEO. That's not normal. Either the GPU situation is tighter than the public framing suggests, or the relationship between "frontier labs compete on models, share on compute" is becoming structural. Probably both. Inference providers without their own silicon story just got a clearer ceiling. If frontier labs are stacking 220k+ GPU deals to keep up, the price floor on flagship-class inference doesn't fall as fast as the open-weight floor does. The gap between "open weights on commodity GPUs" and "frontier on dedicated capacity" stays wide. The cottage industry of routing layers and per-call sidecars built around frontier-lab capacity constraints just had its addressable problem reshaped. When labs solve their own capacity by buying from each other, half of the "I'll route around the cap" pitch loses its sharpest edge. The remaining case is price arbitrage, not availability. What I'm watching for the next 30 days: - Whether other labs announce similar compute deals (Google with someone, OpenAI with anyone besides Microsoft) - Whether AMD MI3xx volume actually shows up in inference benchmarks the way the slides claim, or stays a 2027 story - Whether the price floor on Llama / DeepSeek / Kimi inference keeps falling, or stabilizes now that one of the loudest price-pressure players got absorbed into a different conversation entirely The thing I'm least sure about: does this make multi-provider routing more or less valuable. The "I'll route to whoever has capacity" pitch was strongest when caps were biting. If frontier capacity loosens via cross-lab deals, the case for routing is weaker on availability and stronger on price. Different optimization, same tooling. (For what it's worth, the 5h-window doubling is real on my end today, but I'm more curious about whether other labs respond in kind than whether my own caps held.) Curious how others are reading the compute side of this. Anyone seeing similar moves stack up across labs in your data? submitted by /u/Fresh-Resolution182 [link] [comments]
View originalI built vivkemind – an open-source, local‑first terminal AI coding agent with full AWS Bedrock support
wanted a terminal AI coding agent that doesn't lock me into one model provider. So I forked Qwen Code and added full support for every model available in AWS Bedrock. The result is vivkemind. What vivkemind does: - Runs entirely on your machine, in your terminal. - Uses your own AWS credentials to connect to Bedrock — no third‑party proxy. - Supports all Bedrock models you have access to: Claude, Llama, DeepSeek, Qwen, Mistral, MiniMax, and 90+ more. - Works as an agent: reads your codebase, edits files, runs commands, handles multi‑step tasks. - Tracks token usage and estimates cost for every model call, right in the session stats. - Is fully open source — fork it, add your own tools, wire up new providers, whatever you need. Installation: git clone https://github.com/Lnxtanx/vivekmind-cli.git cd vivekmind-cli npm install && npm run build && npm link export AWS_ACCESS_KEY_ID=... AWS_SECRET_ACCESS_KEY=... AWS_REGION=... vivekmind Then configure your settings.json with the Bedrock models you want and start coding. Why I built it: Most CLI agents lock you into a single company’s API or require you to pay for a subscription on top of your own AI usage. With Bedrock, you already pay AWS for the models you use. vivkemind just gives you a proper terminal agent on top, with no extra costs and no walled gardens. If you're tired of being locked in and want full control over your AI coding workflow, give it a try. Feedback and contributions are welcome. GitHub: https://github.com/Lnxtanx/vivekmind-cli.git submitted by /u/Vivek-Kumar-yadav [link] [comments]
View originalhalf-deployed AI projects haunt my github
Got 47 repos that start with 'just playing with Claude' or 'testing Llama 4 on'. Every single one dead after three commits. Like you get this spark, right? Midnight scrolling leads to some random implementation of retrieval-augmented generation for your personal notes. Brain goes full steam. You're already planning the deployment pipeline while pip installing transformers. Then day two hits. The model's hallucinating your grocery lists into poetry (weirdly beautiful but useless). Your GPU's crying. And suddenly you remember you have actual work that pays actual money. But here's the thing that gets me. These aren't just abandoned experiments, they're digital ghosts of pure optimism. Each one represents that exact moment when everything seemed possible, when you thought you'd crack the code this time, when the future felt close enough to touch. Now I scroll past them looking for that one functional script I actually need. Graveyard of good intentions, all named some variation of 'ai-helper-v2-final-actually-final'. Anyone else got a git log that reads like a museum of broken dreams? submitted by /u/NefariousnessLow9273 [link] [comments]
View originalclaudely: launch Claude Code against Local LLM provider like LM Studio / Ollama / llama.cpp without trashing your real claude config
Plenty of CLI coding agents will talk to a local LLM, but the catch is the ecosystem. Skills, slash commands, MCP servers, plugins, hooks: all the interesting tooling has been built specifically for Claude Code, and parity on every other agent is patchy at best. Trying to reuse a Claude-shaped workflow on a different agent quickly turns into "rewrite all the plugins" or "do without." claudely skips that fight. You keep Claude Code as the client (and its whole plugin / skill / MCP ecosystem with it), and just point it at a model running on your own hardware. Pick a provider, claudely spawns `claude` with the right base URL, auth, and cache fix wired up for that one session. Your shell and the regular `claude` command stay untouched, so you can flip between local and the real Anthropic API without thinking about it. It also quietly fixes a prompt-cache bug that otherwise tanks local-model speed by ~90%, and handles the per-provider env-var differences for you. Works with LM Studio, Ollama, llama.cpp, or any Anthropic-compatible endpoint (point it at a litellm or claude-code-router proxy for OpenAI-protocol backends like vLLM). npm i -g claudely claudely # LM Studio, picker over your downloaded models claudely -p ollama -m gpt-oss:20b # Ollama, skip the picker claudely -p llamacpp # whichever GGUF llama-server is serving MIT, Node 20+, unaffiliated community helper. Built with Claude Code's help, fittingly. Feedback welcome. Repo: https://github.com/mforce/claudely NPM: https://www.npmjs.com/package/claudely submitted by /u/mforce22 [link] [comments]
View originalLLM proxy that lets Claude Code talk to any model
I built rosetta-llm — an open-source multi-format LLM proxy that acts as a drop-in Claude Code gateway. Works as a Claude Code LLM gateway — set `ANTHROPIC_BASE_URL` and all configured models appear in `/model` picker Translates between formats — Anthropic Messages ↔ OpenAI Chat ↔ OpenAI Responses at the wire level Thinking blocks round-trip correctly — this is the hard part and why I built this Provider routing — `openai/gpt-5.4`, `anthropic/claude-opus-4-7`, `groq/llama-4` all through one endpoint Streaming on everything — passthrough fast path + cross-format translation with proper SSE handling The thinking-block problem Most proxies lose reasoning continuity. LiteLLM has had open PRs for thinking block handling for a long time — some dating back months — and they're still not merged. Without proper round-tripping, prompt caching breaks across turns and Claude Code loses context. Rosetta encodes encrypted reasoning into Anthropic's `signature` field and decodes it back — so multi-turn agentic workflows keep their prompt-cache hits. Zero-setup Hugging Face Space Literally a two-line Dockerfile: FROM ghcr.io/lokesh-chimakurthi/rosetta-llm:latest COPY --chown=app:app config.json /app/config.json Add config.json file and above Dockerfile into a HF Space (Docker SDK) and it's running. No clone, no build, no venv. The GHCR image has everything baked in. Make your HF space private and add api keys in hf space secrets. Check readme in github Also works with # No install — ephemeral uvx rosetta-llm # Persistent install uv tool install rosetta-llm rosetta-llm --config ~/.rosetta-llm/config.json # Docker docker run -p 7860:7860 \ -v ~/.rosetta-llm/config.json:/app/config.json \ ghcr.io/lokesh-chimakurthi/rosetta-llm:main Why another proxy? I looked at existing solutions: LiteLLM — thinking block round-trip PRs going nowhere, too many abstractions OpenRouter — great but closed-source, no self-hosting Direct passthrough proxies — don't translate between formats Nothing gave me lossless cross-format translation with proper reasoning fidelity. Links GitHub: https://github.com/Lokesh-Chimakurthi/rosetta-llm PyPI: https://pypi.org/project/rosetta-llm/ Contributions welcome I built this for myself and it works for my use cases. But there's a lot more it could do — better multimodal handling, embeddings support, rate limiting, an admin UI. If any of this sounds interesting, PRs are absolutely welcome. Happy to answer questions in the comments. submitted by /u/DataNebula [link] [comments]
View originalI built a router that automatically sends your AI tasks to the most appropriate model to handle them at low cost - 9,200 tasks in, $21 saved at $0.14 actual cost
The observation that started this: most of what people use AI for every day - summarising, drafting, classifying, extracting etc doesn't actually require a frontier model. Any competent 8-70B model handles those just as well. But most people run everything through Claude or ChatGPT out of habit. I built Followloop (followloop.app) to solve this automatically. It classifies each task by complexity and routes it: - Simple tasks → Cerebras Llama (2000 TPS, 1M tokens/day free), Groq, Gemini Flash - Moderate tasks → Groq 70B, SambaNova - Complex tasks → Claude Haiku as fallback The dashboard shows your actual cost alongside what you'd have paid running everything on Claude Sonnet. I've been running it on my own AI workflow for two weeks: 9,200 tasks routed, $21.24 saved, $0.1360 actual cost. About 157× cheaper per token than Sonnet on average. Works with any AI setup via MCP (Model Context Protocol) - Claude Desktop, Cursor, Claude Code, or anything MCP-compatible. Also has a library of 1,300+ safety-screened MCP servers as a bonus feature. $5/month at followloop.app submitted by /u/QueefLatinahOG [link] [comments]
View originalA Hackable ML Compiler Stack in 5,000 Lines of Python [P]
Hey r/MachineLearning, The modern ML (LLM) compiler stack is brutal. TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. Then there's XLA, MLIR, Halide, Mojo. There is no tutorial that covers the high-level design of an ML compiler without dropping you straight into the guts of one of these frameworks. I built a reference compiler from scratch in ~5K lines of pure Python that emits raw CUDA. It takes a small model (TinyLlama, Qwen2.5-7B) and lowers it to a sequence of CUDA kernels through six IRs. The goal isn't to beat Triton; it is to build a hackable, easy-to-follow compiler. Full article: A Principled ML Compiler Stack in 5,000 Lines of Python Repo: deplodock The pipeline consists of six IRs, each closer to the hardware than the last. Walking the following PyTorch code through every stage (real reference compiler output with names shortened for brevity and comments added): torch.relu(torch.matmul(x + bias, w)) # x: (16, 64), bias: (64,), w: (64, 16) Torch IR. Captured FX graph, 1:1 mirror of PyTorch ops: bias_bc = bias[j] -> (16, 64) float32 add = add(x, bias_bc) -> (16, 64) float32 matmul = matmul(add, w, has_bias=False) -> (16, 16) float32 relu = relu(matmul) -> (16, 16) float32 Tensor IR. Every op is decomposed into Elementwise / Reduction / IndexMap. Minimal unified op surface, so future frontends (ONNX, JAX) plug in without touching downstream passes: bias_bc = bias[j] -> (16, 64) float32 w_bc = w[j, k] -> (16, 64, 16) float32 add = add(x, bias_bc) -> (16, 64) float32 add_bc = add[i, j] -> (16, 64, 16) float32 prod = multiply(add_bc, w_bc) -> (16, 64, 16) float32 red = sum(prod, axis=-2) -> (16, 1, 16) float32 matmul = red[i, na, j] -> (16, 16) float32 relu = relu(matmul) -> (16, 16) float32 The (16, 64, 16) intermediate looks ruinous, but it's never materialized; the next stage fuses it out. Loop IR. Each kernel has a loop nest fused with adjacent kernels. Prologue, broadcasted multiply, reduction, output layout, and epilogue all collapse into a single loop nest with no intermediate buffers. === merged_relu -> relu === for a0 in 0..16: # free (M) for a1 in 0..16: # free (N) for a2 in 0..64: # reduce (K) in0 = load bias[a2] in1 = load x[a0, a2] in2 = load w[a2, a1] v0 = add(in1, in0) # prologue (inside reduce) v1 = multiply(v0, in2) acc0 <- add(acc0, v1) v2 = relu(acc0) # epilogue (outside reduce) merged_relu[a0, a1] = v2 Tile IR. The first GPU-aware IR. Loop axes get scheduled onto threads/blocks, Stage hoists shared inputs into shared memory, and a 2×2 register tile lets each thread accumulate four outputs at once. The K-axis is tiled into two outer iterations of 32-wide reduce. Three-stage annotations below carry the heaviest optimizations: buffers=2@a2 — double-buffer the smem allocation along the a2 K-tile loop, so loads for iteration a2+1 overlap compute for a2. async — emit cp.async.ca.shared.global so the warp doesn't block on global→smem transfers; pairs with commit_group/wait_group fences in Kernel IR. pad=(0, 1, 0) — add 1 element of padding to the middle smem dim so warp-wide loads don't all hit the same bank.kernel k_relu_reduce Tile(axes=(a0:8=THREAD, a1:8=THREAD)): for a2 in 0..2: # K-tile # meta: double-buffered, sync (small, no async needed) bias_smem = Stage(bias, origin=((a2 * 32)), slab=(a3:32@0)) buffers=2@a2 kernel k_relu_reduce Tile(axes=(a0:8=THREAD, a1:8=THREAD)): for a2 in 0..2: # K-tile bias_smem = Stage(bias, origin=((a2 * 32)), slab=(a3:32@0)) buffers=2@a2 x_smem = Stage(x, origin=(0, (a2 * 32)), slab=(a0:8@0, a3:32@1, cell:2@0)) pad=(0, 1, 0) buffers=2@a2 async w_smem = Stage(w, origin=((a2 * 32), 0), slab=(a3:32@0, a1:8@1, cell:2@1)) buffers=2@a2 async # reduce for a3 in 0..32: in0 = load bias_smem[a2, a3] in1 = load x_smem[a2, a0, a3, 0]; in2 = load x_smem[a2, a0, a3, 1] in3 = load w_smem[a2, a3, a1, 0]; in4 = load w_smem[a2, a3, a1, 1] # prologue, reused 2× across N v0 = add(in1, in0); v1 = add(in2, in0) # 2×2 register tile acc0 <- add(acc0, multiply(v0, in3)) acc1 <- add(acc1, multiply(v0, in4)) acc2 <- add(acc2, multiply(v1, in3)) acc3 <- add(acc3, multiply(v1, in4)) # epilogue relu[a0*2, a1*2 ] = relu(acc0) relu[a0*2, a1*2 + 1] = relu(acc1) relu[a0*2 + 1, a1*2 ] = relu(acc2) relu[a0*2 + 1, a1*2 + 1] = relu(acc3) Kernel IR. Schedule materialized into hardware primitives. THREAD/BLOCK become threadIdx/blockIdx, async Stage becomes Smem + cp.async fill with commit/wait fences, sync Stage becomes a strided fill loop. Framework-agnostic: same IR could lower to Metal or HIP: kernel k_relu_reduce Tile(axes=(a0:8=THREAD, a1:8=THREAD)): Init(acc0..acc3, op=add) for a2 in 0..2: # K-tile Smem bias_smem[2, 32] (float) StridedLoop(flat = a0*8 + a1; < 32; += 64): bias_smem[a2, flat] = load bias[a2*32 + flat] Sync # pad row to 33 to kill bank conflicts Smem x_smem[2, 8, 33, 2] (float) StridedLoop(flat = a0*8 + a1; < 512; += 64): cp.async x_smem[a2, flat/64, (flat/2)%32, flat%2] <- x[flat/64*2 + flat%2, a2*3
View originalBuilt a prompt injection proxy that beats OpenAI Moderation and LlamaGuard — see it block attacks live
Built Arc Gate — sits in front of any OpenAI-compatible endpoint and blocks prompt injection before it reaches your model. Try it here — no signup, no code, no setup: https://web-production-6e47f.up.railway.app/try Type any prompt and see if it gets blocked or passes. The examples on the page show the difference. The main detection layer is a behavioral SVM on sentence-transformer embeddings — catches semantic intent, not just pattern matches. Phrase matching is just the fast first pass. Four layers total. Benchmarked on 40 OOD prompts (indirect, roleplay, hypothetical framings — the hard stuff): • Arc Gate: Recall 0.90, F1 0.947 • OpenAI Moderation: Recall 0.75, F1 0.86 • LlamaGuard 3 8B: Recall 0.55, F1 0.71 Zero false positives on benign prompts including security discussions and safe roleplay. Block latency 329ms. One URL change to integrate into your own project: base_url=“https://web-production-6e47f.up.railway.app/v1” GitHub: github.com/9hannahnine-jpg/arc-gate — star if useful. submitted by /u/Turbulent-Tap6723 [link] [comments]
View originalRepository Audit Available
Deep analysis of meta-llama/codellama — architecture, costs, security, dependencies & more
CodeLlama uses a tiered pricing model. Visit their website for current pricing details.
Key features include: We are releasing Code Llama 70B, the largest and best-performing model in the Code Llama family, CodeLlama - 70B, the foundational code model;, CodeLlama - 70B - Python, 70B specialized for Python;, and Code Llama - 70B - Instruct 70B, which is fine-tuned for understanding natural language instructions., Code Llama is a state-of-the-art LLM capable of generating code, and natural language about code, from both code and natural language prompts., Code Llama is free for research and commercial use., Code Llama, the foundational code model;, Codel Llama - Python specialized for Python;.
CodeLlama is commonly used for: Automating code generation for web applications, Assisting developers with code completion, Generating documentation from code comments, Translating code from one programming language to another, Creating unit tests from existing code, Debugging code by suggesting fixes.
CodeLlama integrates with: GitHub Copilot, Visual Studio Code, Jupyter Notebooks, Slack for team collaboration, Trello for project management, Asana for task tracking, Zapier for workflow automation, AWS Lambda for serverless applications, Google Cloud Functions, Docker for containerization.
CodeLlama has a public GitHub repository with 16,334 stars.
Based on user reviews and social mentions, the most common pain points are: down, API bill, token usage, token cost.
Based on 72 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.