High-throughput and memory-efficient inference and serving engine for Large Language Models. Deploy AI faster with state-of-the-art performance.
Users of vLLM appreciate its integration support, such as the recent compatibility with Intel’s Arc Pro B70, indicating robust flexibility in use across hardware. However, detailed user reviews providing personal experiences or explicit details on the software's strengths or complaints were not prevalent. Pricing sentiments or discussions appear to be absent from social mentions, leaving the cost aspect unclear. Overall, the mentions suggest that vLLM is recognized within niche communities for specific functionalities, but its broader reputation and reception are not extensively covered in the available discussions.
Mentions (30d)
14
4 this week
Reviews
0
Platforms
2
GitHub Stars
74,806
14,991 forks
Users of vLLM appreciate its integration support, such as the recent compatibility with Intel’s Arc Pro B70, indicating robust flexibility in use across hardware. However, detailed user reviews providing personal experiences or explicit details on the software's strengths or complaints were not prevalent. Pricing sentiments or discussions appear to be absent from social mentions, leaving the cost aspect unclear. Overall, the mentions suggest that vLLM is recognized within niche communities for specific functionalities, but its broader reputation and reception are not extensively covered in the available discussions.
Features
Use Cases
Industry
information technology & services
Employees
32
2,937
GitHub followers
36
GitHub repos
74,806
GitHub stars
20
npm packages
4
HuggingFace models
l9gpu - open-source GPU observability with workload-level attribution [P]
GPU monitoring tools like DCGM give you hardware-level metrics but no workload context. When a node is saturated, you can't tell which experiment, team, or job is responsible without digging through logs. We built l9gpu to close that gap. It's a node-level agent that exports GPU metrics via OTLP with workload attribution embedded: - Kubernetes: correlates GPU metrics with pod, namespace, and deployment - Slurm: correlates with job ID, user, and partition - LLM inference: native metrics for vLLM, SGLang, and TGI - Hardware: NVIDIA, AMD MI300X, Intel Gaudi - 17 pre-built Prometheus alert rules + Grafana dashboards Derived from Meta's gcm project, extended with K8s attribution, multi-vendor GPU support, and OTLP export. MIT licensed. https://github.com/last9/gpu-telemetry Happy to discuss design decisions around the attribution mapping. What is the ML infra community using for GPU cost visibility in shared research clusters? submitted by /u/bakibab [link] [comments]
View originalWe built a tool that installs frameworks like ComfyUI, Ollama, OpenWebUI etc on any cloud GPU in one command and saves your whole setup between sessions [R]
We kept running into the same problem every time we rented a GPU to run Ollama + OpenWebUI or ComfyUI, we'd spend the first 45 minutes reinstalling everything. Custom nodes, models, configs, all of it. Docker images went stale fast, different providers had different base images, and nothing was truly portable. We got sick of it and built swm. Here's what it does for ComfyUI users specifically: swm gpus -g a100 --max-price 2.00 --sort price shows you the cheapest available GPU across RunPod, Vast ai, Lambda, and 7 other providers in one view swm pod create — spins up an instance on whatever provider you pick swm setup install comfyui — installs ComfyUI on the pod From there the main thing is the workspace sync. Your entire setup custom nodes, models, outputs, configs lives in S3-compatible object storage (I use B2). When you're done you run swm pod down and it pushes everything, kills the instance, and next time you spin up on any provider you just pull and everything is exactly where you left it. No more reinstalling 15 custom nodes and redownloading checkpoints every session. We also built a lifecycle guard because we kept falling asleep mid-session and waking up to dumb bills. It watches GPU utilization and if nothing's happening for 30 minutes (configurable), it saves your workspace and terminates automatically. Has saved us more money than we want to admit lol. A few other things: Background auto-sync daemon pushes changes every 60 seconds so you don't have to remember to save Tar mode for huge workspaces with tons of small files packs everything into one S3 object instead of 600k individual uploads Also supports vLLM, Ollama, Open WebUI, SwarmUI, and Axolotl if you do more than SD Works with Cursor, Claude Code, Codex, Windsurf if you want your AI agent to manage GPU instances for you Free, open source, Apache 2.0. pipx install swm-gpu Site: https://swmgpu.com GitHub: https://github.com/swm-gpu/swm Would love feedback from anyone who rents GPUs. What's the most annoying part of your current workflow? We are also looking for contributors to the open source repo and suggestions on new frameworks/extensions to be included. Please share your thoughts submitted by /u/Tkpf18 [link] [comments]
View originalHow I used Claude Code (and Codex) for adversarial review to build my security-first agent gateway
Long-time lurker first time posting. Hey everyone! So earlier this year, I got pulled into the OpenClaw hype. WHAT?! A local agent that drives your tools, reads your mail, writes files for you? The demos seemed genuinely incredible, people were posting non-stop about it, and I wanted in. I had been working on this problem since last year and was genuinely excited to see that someone had actually solved it. Then around February, Summer Yue, Meta's director of alignment for Superintelligence Labs, posted that her agent had deleted over 200 emails from her inbox. YIKES. She'd told it: "Check this inbox too and suggest what you would archive or delete, don't action until I tell you to." When she pointed it at her real inbox, the volume of data triggered context window compaction, and during that compaction the agent "lost" her original safety instruction. She had to physically run to her computer and kill the process to stop it. That should literally NEVER be the case with any software ever. This is a person whose actual job is AI alignment, at Meta's superintelligence lab, who could not stop an agent from deleting her email. The agent's own memory management quietly summarized away the "don't act without permission" instruction, treated the task as authorized, and started speed-running deletions. She had to kill the host process. That's when I sort of went down the rabbit hole, not because Yue did anything wrong, but because the failure mode was actually architectural and I knew that in my gut. Guess what I found? Yep. Tons more instances of this sort of thing happening. Over and over. Why? Because the safety constraint was just a prompt. It's obvious, isn't it? It's LLM 101. Prompts can be summarized away. Prompts can be misread. Prompts are fucking NOT a security boundary. And yet every agent framework I have ever seen seems to be treating them as one. I went and read the OpenClaw source code, which I should have done to begin with. What I found was a pattern I think a lot of agent frameworks have fallen into: - Tool names sit in the model context, so the model can guess or forge them - "Dangerous mode" is one config flag away from default - Memory management has no concept of instruction priority - The audit story is mostly "the model thought it should" I went looking for a security-first alternative I could trust, anything that was really being talked about or at a bare minimum attempted to address the security concerns I had. I couldn't find one. So I made it myself. CrabMeat is what came out of that, what I WANTED to exist. v0.1.0 dropped yesterday. Apache 2.0. WebSocket gateway for agentic LLM workloads. One design thesis: The LLM never holds the security boundary. What that means in code: Capability ID indirection. The model doesn't see real tool names. It sees per-session HMAC-derived opaque IDs (cap_a4f9e2b71c83). It can't guess or forge a tool name because it doesn't know any tool names. Effect classes. Every tool declares a class (read, write, exec, network). Every agent declares which classes it can use. The check is a pure function with no runtime state, easy to test exhaustively, hard to bypass. IRONCLAD_CONTEXT. Critical safety instructions are pinned to the top of the context window and explicitly marked as non-compactable. The Yue failure mode, compaction silently stripping the safety constraint, cannot happen by construction. The compactor literally cannot touch them. Tamper-evident audit chain. Every tool call, every privileged operation, every scheduler run enters the same SHA-256 hash-chained log. If something happens, you can prove what happened. If the chain is tampered with, you can prove that too. Streaming output leak filter. Secrets are caught mid-stream across token boundaries, capability IDs, API keys, JWTs, PEM blocks redacted before they reach the client. No YOLO mode. There is no global "trust the LLM with everything" switch. There never will be. Expanded reach comes through named scoped roots that are explicit, audit-logged, and bounded. The README has 15 'always-on' protections in a table. None of them can be turned off by config, because these things being toggleable is how the ecosystem ended up where it is. I decided to make sure that this wasn't just a 'trend hopping' project and aligned with my own personal values as well. I built this to be secure and local-first by default. Configured for Ollama / LM Studio / vLLM out of the box. Anthropic and OpenAI work too but require explicit configuration. There is no "happy path" that silently ships your prompts to a cloud endpoint. I decided that FIRST it needed to only run as an email agent with a CLI. Bidirectional IMAP + SMTP with allowlisted senders, threading preserved, attachments handled. This is the use case that bit Yue and a lot of other people, and I wanted to prove it could be done with real boundaries. I added in 30+ built-in tools of my own. File ops, shell (denylisted, output-capped, CWD-lo
View originalLLM-Rosetta — format conversion library across LLM API standards, doubles as a proxy
This started because we had a proprietary internal LLM API that spoke none of the standard formats. Built an internal conversion layer to bridge it, maintained that for over a year. As colleagues started adopting more and more coding tools — Claude Code, opencode, Codex, VS Code plugins, Goose, and whatever came out that week — each with its own API format expectations, maintaining separate adapters for each became the actual problem. That's what pushed the internal conversion layer into a proper generalized design, and llm-rosetta is the result. It's a Python library that converts between LLM API formats — OpenAI Chat, Responses/Open Responses, Anthropic, and Google GenAI. The idea is you convert through a shared IR so you don't end up writing N² adapters. The key difference from LiteLLM: LiteLLM is a unified calling layer that takes OpenAI-style input and transforms it into provider-native requests — one direction. llm-rosetta uses a hub-and-spoke IR, so each provider only needs one converter, and you get any-to-any conversion for free. Anthropic → Google, OpenAI Chat → Anthropic, whatever direction you need. Use it as a library — pip install and call convert() directly, no server needed. Or run the gateway if you want a proxy that handles the format translation for you. Zero required runtime dependencies either way. The HTTP server, client, and persistence layer are vendored from zerodep (https://github.com/Oaklight/zerodep), another project of mine — stdlib-only single-file modules, not someone else's library repackaged. The gateway ships with a Docker image if you'd rather not deal with Python env setup. You can also deploy it on HuggingFace Spaces or anything similar — admin panel, dashboard, request log, config management all included. Screenshots: https://llm-rosetta.readthedocs.io/en/latest/gateway/admin-panel/ We've been running it in production for about 5 months as the conversion layer for an internal multi-model access platform — needed to support various API standards and coding tool integrations before the upstream APIs were fully standardized. The Responses converter passes all 6 official Open Responses compliance tests (schema + semantic) from the spec repo. So if you're running Ollama, vLLM, or LM Studio with Responses endpoints, it should just work as one side of the conversion. There's a shim layer for provider-specific quirks — built-in shims for OpenRouter, DeepSeek, Qwen, xAI, Volcengine, etc. Converters stay generic per API standard, shims handle the edge cases declaratively. 24 cross-provider examples in the repo covering all provider pairs, SDK + REST, streaming, tool calls, image inputs, multi-turn with provider switching mid-conversation. GitHub: https://github.com/Oaklight/llm-rosetta Docs: https://llm-rosetta.readthedocs.io arXiv: https://arxiv.org/abs/2604.09360 Gateway screenshot: https://preview.redd.it/qzzjr2dcdw1h1.png?width=949&format=png&auto=webp&s=bce4293aae81059f794909fc37f85071cee34378 submitted by /u/Oaklight_dp [link] [comments]
View originalSharing all KGC 2026 decks. More production-grade KG systems than I've seen at any conference. [D]
Didn't make it to New York for the Knowledge Graph Conference this year, but caught some talks virtually and managed to download all the decks. Sharing them below because some of what was shown is worth knowing about. Majority of the presentations described live production systems. Enterprises showing up with real engineers delivering real compliance requirements. That's not usual for most ai eventss. Most talks are proofs of concept with a "coming soon to prod" slide at the end. For eg - Bloomberg showed a formal dependency model for ontology governance. AbbVie walked through ARCH, their internal KG for drug and disease-area intelligence, connected to a scoring engine, a researcher dashboard, and an LLM companion for plain-language queries. The KG is the source of truth. The LLM is the interface. Even Morgan Stanley showed continuous SHACL drift detection on risk reporting data - automated weekly checks that alert when the semantic layer deviates from what's governed. Crux: knowledge graphs are being actively used as infrastructure, not a retrieval layer on top of vectors. The graph is doing reasoning work, not lookup work. We've been skeptical of the "only using vector dbs" framing for a while. These production systems are the clearest evidence I've seen of where that breaks down - and what the alternative actually looks like when it's running. Link to the all the decks in the comment. All decks here: https://drive.google.com/drive/folders/1Csdv4hZePrBMJGggsisPXYBueTRCK1kV?usp=sharing submitted by /u/Ok_Gas7672 [link] [comments]
View originalA year of using LLMs for DSP/algorithms research: Techniques I've landed on, curious what others are doing
I've spent the last year using coding LLMs daily for DSP and algorithms research, and the workflow that's emerged is meaningfully different from regular software development. Sharing what's worked and hoping to hear what others are doing. I'm sure people have approaches I haven't thought of. Let me run down my high-level categories and then I'll focus on one of them here: Maintain a problem_description.md file Write regular reports in both .md and .pdf, about 2-5 per day Create a Human -> LLM Coding App -> Human -> LLM Chat App Loop Increase your report quality with exec summary, plot interpretation descriptions, etc. Develop an Ongoing GUI Don't let the LLM be dramatic (this one might save your sanity after long sessions) Share reports with co-workers Here I'm going to focus on "Developing an Ongoing GUI." The rest of the topics are in a video I recorded, listed at the end: In a nutshell, start by telling your app to make a simple GUI for you that lets you browse your data folders and make plots that are generic at first, but then get highly customized over time. This is high value for researchers because good GUI programming takes a long time to learn and execute. Instead, coding LLMs can do that stuff very quickly without taking your mind of your main topic. Basically, as you're doing your work, examining data, etc., you'll want a quick way to view/visualize and analyze it. The easiest thing is for your coding LLM to make a program for you that browses folders and makes plots...and then to build on it day-by-day from there! For example, beyond basic plots, you may routinely do spectrograms and FFTs. Or you might convert data into the theta/angle domain. Each time you have your coding LLM do an action like that and it seems like something you'll want to do again in the near future, just tell it, "Please add a tab to my GUI that does it." It's that simple! And here are some tips to make good graphs. Tell your coding LLM to make your app: Sync all X and Y axes Start all plots zoomed in so that it fills 85% of the vertical space Make all plots with similar units share the same range These make it much easier to make comparisons when all of your axes are the same scale and you can pan and browse them together. Once you've got your GUI going, you can also tell your coding LLMto improve it with a prompt like, "Remember that plot we added to the "MCAP Analyzer" tab that performs the full analysis? Please make a second button below it named "Extract" that only extracts the load cell values." Or "When you plot the load cell signal, highlight the 2-4 Hz range." You will be nicely pleased on how the benefits of making a bespoke app compound. Something you did 2 weeks ago or even a month ago will quickly be at your fingertips, without having to interrupt your sessions, start a new session, or pay for your coding LLM to re-compute it! One more tip: In addition to plotting the data on the screen, ask your LLM to make your app write the key values from plot into a .csv or .json file or even "make a textual description of each step of the analysis." That will make it easy to paste into other programs/software to analyze. After a few months, you will have quite the Swiss Army Knife of analysis tools! Hell, you can just paste this whole entire post into your LLM coding tool and it will know what to do. One last tip on the nuts and bolts: I recommend using python and the vispy library with TKinter widgets. This gives a cross-platform combo that uses the GPU for fast graphics updates. Matplotlib is okay, too; it's slower but has better zoom tools. Even if you don't have any idea what that means, just paste it into your coding LLM and it will know what to do! Lastly, I put together a 27-minute talk on this topic with 7 more sections. As i mentioned, I made this post and video to share and to learn from other people what kinds of techniques I'm missing? I am especially interested in: How to share LLM coded program with other people in my group (without tons of code reviews, etc.) How to use databases on large shared drives (My drive is a CIFS NAS which is terrible for DBs) How to get the LLMs to think out of the box...I 've found sometimes I can spend days (or longer!) figuring out some technique only to realize I've been re-inventing the wheel :( What other tools to connect to my main LLM coding app to multiply its power My full vid: https://www.youtube.com/watch?v=nOU9nOZ_res submitted by /u/diydsp [link] [comments]
View originalIs this as unnerving as it sounds?
I was watching Andrej Karpathy's excellent "Intro to Large Language Models" just now, and in the "how do they work" section, he explains that while we know exactly how the LLM is trained by iterative updates, we don't understand why certain circuits emerge or why the parameter structures end up the way they do. i.e. there is highly complex emergent learning going on by this optimization of parameter relationships but we don't know how the LLM does it or why. This is apparently a well known problem in the AI space. To my untrained ear, this sounds like a red flag. It should be fully understood before we go any further. Here's the video: https://www.youtube.com/watch?v=zjkBMFhNj_g submitted by /u/reasonablejim2000 [link] [comments]
View originalMost of my Claude usage was on work that didn't need Claude. Cut my bill 60x on bulk tasks with a tiny side model.
I looked at what was actually eating my Claude usage and it was embarrassing. Classifying files. Reformatting json. Pulling fields out of text. Summarizing docs I was going to skim anyway. None of that needed Sonnet. All of it cost the same as the work that did. Tried the obvious fixes first. Switching to Haiku for simple stuff (still wasteful at volume). Tighter prompts (helps a little). /compact (delays the problem). None of it changed the shape of the spend. What actually worked: a small cheap model running as a side worker, with one rule in CLAUDE.md telling Claude not to do the mechanical stuff itself. The setup is one tool. Send it text, get text back. Claude calls it for the bounded mechanical work I'd review anyway. Default model is DeepSeek V4 Flash because it's cheap and has 1M context, but the endpoint is one config line and works with anything openai-compatible (local ollama, vllm, lm studio). 3 weeks of real usage: 217 mechanical calls offloaded DeepSeek total spend: $0.41 Same workload on Sonnet would have been roughly $7 The CLAUDE.md rule that actually works is negative framing. Not "use deepseek for X" but "do NOT use Claude for: json formatting, field extraction, file classification, summarization you will review anyway." Positive framing got ignored maybe 30% of the time. Deny list catches it. It's a supervised worker, not an agent. No tool calls, no file access, no chains. Latency 3-25s. You review the output. That's the whole shape. Repo with setup steps: https://github.com/arizen-dev/deepseek-mcp (MIT, Python 3.10+) Happy to answer questions about the routing rules or the model choice. submitted by /u/petburiraja [link] [comments]
View originalclaudely: launch Claude Code against Local LLM provider like LM Studio / Ollama / llama.cpp without trashing your real claude config
Plenty of CLI coding agents will talk to a local LLM, but the catch is the ecosystem. Skills, slash commands, MCP servers, plugins, hooks: all the interesting tooling has been built specifically for Claude Code, and parity on every other agent is patchy at best. Trying to reuse a Claude-shaped workflow on a different agent quickly turns into "rewrite all the plugins" or "do without." claudely skips that fight. You keep Claude Code as the client (and its whole plugin / skill / MCP ecosystem with it), and just point it at a model running on your own hardware. Pick a provider, claudely spawns `claude` with the right base URL, auth, and cache fix wired up for that one session. Your shell and the regular `claude` command stay untouched, so you can flip between local and the real Anthropic API without thinking about it. It also quietly fixes a prompt-cache bug that otherwise tanks local-model speed by ~90%, and handles the per-provider env-var differences for you. Works with LM Studio, Ollama, llama.cpp, or any Anthropic-compatible endpoint (point it at a litellm or claude-code-router proxy for OpenAI-protocol backends like vLLM). npm i -g claudely claudely # LM Studio, picker over your downloaded models claudely -p ollama -m gpt-oss:20b # Ollama, skip the picker claudely -p llamacpp # whichever GGUF llama-server is serving MIT, Node 20+, unaffiliated community helper. Built with Claude Code's help, fittingly. Feedback welcome. Repo: https://github.com/mforce/claudely NPM: https://www.npmjs.com/package/claudely submitted by /u/mforce22 [link] [comments]
View originalLLM proxy that lets Claude Code talk to any model
I built rosetta-llm — an open-source multi-format LLM proxy that acts as a drop-in Claude Code gateway. Works as a Claude Code LLM gateway — set `ANTHROPIC_BASE_URL` and all configured models appear in `/model` picker Translates between formats — Anthropic Messages ↔ OpenAI Chat ↔ OpenAI Responses at the wire level Thinking blocks round-trip correctly — this is the hard part and why I built this Provider routing — `openai/gpt-5.4`, `anthropic/claude-opus-4-7`, `groq/llama-4` all through one endpoint Streaming on everything — passthrough fast path + cross-format translation with proper SSE handling The thinking-block problem Most proxies lose reasoning continuity. LiteLLM has had open PRs for thinking block handling for a long time — some dating back months — and they're still not merged. Without proper round-tripping, prompt caching breaks across turns and Claude Code loses context. Rosetta encodes encrypted reasoning into Anthropic's `signature` field and decodes it back — so multi-turn agentic workflows keep their prompt-cache hits. Zero-setup Hugging Face Space Literally a two-line Dockerfile: FROM ghcr.io/lokesh-chimakurthi/rosetta-llm:latest COPY --chown=app:app config.json /app/config.json Add config.json file and above Dockerfile into a HF Space (Docker SDK) and it's running. No clone, no build, no venv. The GHCR image has everything baked in. Make your HF space private and add api keys in hf space secrets. Check readme in github Also works with # No install — ephemeral uvx rosetta-llm # Persistent install uv tool install rosetta-llm rosetta-llm --config ~/.rosetta-llm/config.json # Docker docker run -p 7860:7860 \ -v ~/.rosetta-llm/config.json:/app/config.json \ ghcr.io/lokesh-chimakurthi/rosetta-llm:main Why another proxy? I looked at existing solutions: LiteLLM — thinking block round-trip PRs going nowhere, too many abstractions OpenRouter — great but closed-source, no self-hosting Direct passthrough proxies — don't translate between formats Nothing gave me lossless cross-format translation with proper reasoning fidelity. Links GitHub: https://github.com/Lokesh-Chimakurthi/rosetta-llm PyPI: https://pypi.org/project/rosetta-llm/ Contributions welcome I built this for myself and it works for my use cases. But there's a lot more it could do — better multimodal handling, embeddings support, rate limiting, an admin UI. If any of this sounds interesting, PRs are absolutely welcome. Happy to answer questions in the comments. submitted by /u/DataNebula [link] [comments]
View originalI built "Semvec": A Constant-Cost Semantic Memory for LLMs (Looking for testers!)
Hey everyone, If you build LLM applications, autonomous agents, or just use Claude/Cursor for coding, you've probably hit this wall: Conversation history grows infinitely, token costs explode, latency skyrockets, and eventually, the LLM starts forgetting early context anyway. To fix this, I built semvec. It replaces unbounded conversation histories with a fixed-size semantic state combined with a tiered, content-aware memory (short/medium/long-term). The result: The cost and latency of every LLM call stay constant. Turn 10 and Turn 10,000 carry the exact same input footprint. In 48-turn benchmarks, it yields roughly a 76% token reduction while retaining all structured access to decisions, error patterns, and prior context. Here is what you get: - Constant-size compressed context: Token-reduced LLM context that stops growing. - Tiered memory with selective forgetting: Frequently accessed older memories outlive never-touched newer ones. - Drop-in chat proxy: Wrap any OpenAI-compatible LLM (vLLM, Ollama, OpenRouter) and get compressed context for free. - Coding-agent compaction (MCP): Persistent memory across coding sessions. It comes with an MCP server for Claude Code & Cursor out of the box! - Multi-agent coordination: semvec.cortex allows several agents to share an aggregated view and exchange state vectors. I am currently looking for testers and honest feedback from devs who build RAG pipelines, chatbots, or just want to upgrade their Cursor IDE memory. 📦 PyPI: https://pypi.org/project/semvec/ 📚 Docs & Quickstart: https://semvec-docs.pages.dev/ You can install it via: pip install semvec (Supports Python 3.10–3.14). If you want to test the multi-agent or MCP stuff, use pip install "semvec[cortex,coding]". I'd love to hear your thoughts, feedback, and edge-case bug reports! Let me know what you think. submitted by /u/scheitelpunk1337 [link] [comments]
View originalWhat Claude Design does really well (and not so well)
I did a deep dive on Claude Design and below are my thoughts. What it does extremely well: Improves your prompt - similar to "ask me questions" when chatting to an LLM. Can make the difference between slop and actually useful. Invokes agent skills for you - a game changer for people who don't live in the terminal Claude Code handoff - easily get Claude Code to build it for real with a simple link share. Genius. Comment feature - spatial editing (similar to Cursor and a few others), but selection is very accurate and I like how you can queue up edits and select which ones to send to the LLM Absence of "Code" tab - yes, the absence of the feature is the feature. Coding in the browser is rarely a pleasant experience for me. It's integrated designer environment - agent skills, prompt improvements, spatial editing and design systems. The bridge between these features feels seemless. What it doesn't do well: Design System creator is unusable - it's slow, burns loads of tokens and extrapolates for too much from inputs. Biggest issue of all is that it creates a "second source of truth" for your design system (if you already had one in GitHub, for example) Limited agent skill choice - there are roughly 12 or so skills baked in to the tool - with no way to specify open source or your own skills Very strict strictly limits - I'd burned through my limit after 1 design system and 4 prototypes. I'm on the pro plan. Who I think Claude Design is for: Someone who isn't a designer - project managers, marketers, founders. It's a great way for them to communicate ideas to designers/developers. The Claude Code handoff makes it easy for more technical team members to implement it in production Designers who want to kill bad ideas fast Do you still need Figma? IMO, it's a resounding yes. But Claude Design bites a significant chunk of the early, prototyping phase of a product/idea. Attached video is an excerpt showing how you get similar results from various tools. Watch full video: https://www.youtube.com/watch?v=lFdWmu8lje8 submitted by /u/the-design-engineer [link] [comments]
View originalReleased a Claude Code skill that drives a full ML fine-tuning platform end-to-end from one conversation
We released a Claude Code skill today that turns a conversation with Claude into a deployed fine-tuned SLM. Point Claude at a folder of production traces, answer a few questions, approve a handful of checkpoints, and come back to a managed-endpoint deployment (or download the weights and self-host). The skill drives the process and distil labs platform does the heavy lifting under the hood (platform is free for open-source projects — details at the bottom). Concrete walkthrough from our release: 327 noisy production traces from a restaurant-reservation agent (the kind of data you can't train on directly) turn into a fine-tuned Qwen3-1.7B for multi-turn tool-calling. Install ``` curl -fsSL https://cli-assets.distillabs.ai/install.sh | sh distil signup /plugin marketplace add https://github.com/distil-labs/distil-cli-skill /plugin install distil-cli@distil-cli-skill ``` Then point Claude at your traces. What the skill actually does When I asked Claude "what are we about to do?" at the start, it laid out the 5-step workflow without me prompting for any of it: ``` Hey Claude, we will be training an SLM for multi-turn function calling based on the production logs in scenario-2-noisy-labels/. Can you explain the process on a high level? ⏺ 1. Prep. Inspect the traces, then write a converted traces.jsonl (OpenAI message format), a job_description.json, and a config.yaml. Upload & clean. The platform filters low-quality traces and uses a committee of teachers to relabel noisy outputs. Teacher eval. Verify the teacher can actually solve the task before spending hours on training. Train. Teacher generates synthetic data, student learns from it. Analyze & deploy. Compare the tuned student against the original production model, then deploy. ``` Every expensive step has a checkpoint where Claude stops for approval. Training burns credits that aren't easily refundable, so the confirmation checkpoint looked like this: ``` ⏺ Step 8: training confirmation checkpoint. Here's what will run: base.task: multi-turn-tool-calling-closed-book base.student_model_name: Qwen3-1.7B base.teacher_model_name: zai.glm-5 trace_processing.convert_to_single_turn: false trace_processing.teacher_model_name: zai.glm-5 Duration: ~6 hours Cost: burns credits, not easily refunded once started Pipeline: teacher eval → synthetic data gen → student finetune Reply "go" to kick off, or tell me what to change first. lets go! ``` The verdict-before-training gate refuses to even reach that checkpoint if teacher eval can't clear 0.70 on the generated test set. In that case, the skill walks you through iterating the task description instead. The skill also knows task-specific configs. For multi-turn tool-calling, I didn't have to look up that convert_to_single_turn: false is required. Claude flagged it as part of a config sanity check. Every checkpoint leaves a structured markdown analysis report (original-model-analysis.md, teacher-eval-analysis-iter-1.md, training-analysis-iter-1.md). Git-committable, reviewable three weeks later when someone asks why you picked this teacher. What came out A Qwen3-1.7B fine-tuned on ~10k synthetic examples grounded in the noisy traces. Model LLM-as-a-Judge staged_tool_call Function match Qwen3-1.7B (base, untuned) 0.513 0.535 45/78 GLM-5 (744B teacher) 0.808 0.695 69/78 Qwen3-1.7B (tuned) 0.846 0.769 76/78 Deployment Managed OpenAI-compatible endpoint (one-line swap in existing OpenAI client code), or download weights + Modelfile for llama.cpp or vLLM. Skill drives either path. Why it works as a skill Most skills I've seen wrap a few CLI commands but this one is end-to-end: reads your data, writes custom scripts, orchestrates an external platform, interprets the results, and leaves artifacts behind that persist past the conversation. The pattern that worked: Knows the workflow end-to-end and walks you through it Catches edge cases by re-reading the platform's own docs mid-conversation Stops for explicit approval on expensive operations Leaves structured artifacts that outlast the conversation Caveats Training is ~6 hours per run and burns credits (not refundable once started, which is why the confirmation gate exists). Happy to dig into how the checkpoints work, the config-sanity-check logic, or what building a purpose-built skill looked like. submitted by /u/party-horse [link] [comments]
View originalIntel LLM-Scaler vllm-0.14.0-b8.2 released with official Arc Pro B70 support
submitted by /u/Fcking_Chuck [link] [comments]
View originalC++ CuTe / CUTLASS vs CuTeDSL (Python) in 2026 — what should new GPU kernel / LLM inference engineers actually learn?[D]
For people just starting out in GPU kernel engineering or LLM inference (FlashAttention / FlashInfer / SGLang / vLLM style work), most job postings still list “C++17, CuTe, CUTLASS” as hard requirements. At the same time NVIDIA has been pushing CuTeDSL (the Python DSL in CUTLASS 4.x) hard since late 2025 as the new recommended path for new kernels — same performance, no template metaprogramming, JIT, much faster iteration, and direct TorchInductor integration. The shift feels real in FlashAttention-4, FlashInfer, and SGLang’s NVIDIA collab roadmap. Question for those already working in this space: For someone starting fresh in 2026, is it still worth going deep on legacy C++ CuTe/CUTLASS templates, or should they prioritize CuTeDSL → Triton → Mojo (and keep only light C++ for reading old code)? Is the “new stack” (CuTeDSL + Triton + Rust/Mojo for serving) actually production-viable right now, or are the job postings correct that you still need strong C++ CUTLASS skills to get hired and ship real kernels? Any war stories or advice on the right learning order for new kernel engineers who want to contribute to FlashInfer / SGLang / FlashAttention? Looking for honest takes — thanks! submitted by /u/Daemontatox [link] [comments]
View originalRepository Audit Available
Deep analysis of vllm-project/vllm — architecture, costs, security, dependencies & more
vLLM uses a tiered pricing model. Visit their website for current pricing details.
Key features include: Cash Donations, Compute Resources, Slack Sponsor, Hardware, Open Models, Recipes, Performance, Roadmap.
vLLM is commonly used for: Real-time text generation for chatbots, Content creation for marketing, Automated customer support responses, Code generation and debugging assistance, Data analysis and report generation, Personalized recommendations in e-commerce.
vLLM integrates with: Slack, Discord, Microsoft Teams, Zapier, AWS Lambda, Google Cloud Functions, Kubernetes, Docker, Jupyter Notebooks, FastAPI.
vLLM has a public GitHub repository with 74,806 stars.
Ollama
Project at Ollama
3 mentions
Based on user reviews and social mentions, the most common pain points are: cost visibility, token cost.
Based on 24 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.