Train, deploy, observe, and evaluate LLMs from a single platform. Lower cost, faster latency, and dedicated support from Inference.net.
Users frequently praise "Inference" for its efficient processing capabilities, particularly highlighted in the development of new optimization techniques that accelerate long-context AI model processing. However, there are notable concerns about the high costs associated with compute resources, suggesting pricing can often be a barrier for smaller operations. Discussions around pricing structures reveal some confusion and variability over appropriate multipliers for cost to price translations. Overall, "Inference" enjoys a strong reputation for performance but faces challenges regarding cost-effectiveness for broader market adoption.
Mentions (30d)
30
Avg Rating
5.0
1 reviews
Platforms
6
Sentiment
10%
13 positive
Users frequently praise "Inference" for its efficient processing capabilities, particularly highlighted in the development of new optimization techniques that accelerate long-context AI model processing. However, there are notable concerns about the high costs associated with compute resources, suggesting pricing can often be a barrier for smaller operations. Discussions around pricing structures reveal some confusion and variability over appropriate multipliers for cost to price translations. Overall, "Inference" enjoys a strong reputation for performance but faces challenges regarding cost-effectiveness for broader market adoption.
Features
Use Cases
Industry
information technology & services
Employees
8
Funding Stage
Seed
Total Funding
$11.8M
Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon
View originalPricing found: $25, $2.50, $5.00, $0.02, $0.05
g2
What do you like best about Inference?This app helps me get customers' measurements remotely anytime with high accuracy. Now I can serve my client globally. Review collected by and hosted on G2.com.What do you dislike about Inference?Nothing much. I wish they have a foot size measurements app for shoes also. Review collected by and hosted on G2.com.
Looking for real world comparisons between WALL OSS pi0.6 and OpenVLA[D]
I am choosing a baseline for a real manipulation stack and trying not to lose a month on setup that someone here has already done. Shortlist is OpenVLA, pi0.6, and WALL OSS from X Square Robot. OpenVLA is still the easiest reference point with lots of reproductions. pi0.6 looks strong from recent public updates but I have not seen many fully transparent ablations. WALL OSS looks promising in LeRobot and I can run inference on UR5 plus parallel gripper without issues, around 70 ms on a 4090 in my local setup. What I need is less paper score discussion and more deployment reality. If you have run a controlled comparison on LIBERO or ManipArena style tasks, I would really value failure modes and data budget details. If you have fine tuned any of these on real hardware, which one was least painful on demonstration volume. If you run continuous updates, how often do you retrain and how bad is drift over a few weeks. I can post my own table once I finish, but if there is existing work I should read first that would save a lot of duplicated effort. submitted by /u/Dense-Sir-6707 [link] [comments]
View originalBuild agentic orchestrators in minutes NOT months.
Some of you might remember BoneScript, my LLM friendly declarative backend compiler. MarrowScript is the next version and the big addition is a full LLM harness built into the language itself. The problem I kept running into: every project that calls an LLM ends up with the same pile of glue code. Retry logic, response validation, caching, cost tracking, provider switching, confidence routing. You write it once, copy it to the next project, tweak it, and it slowly rots. None of it is your actual product logic but it takes up half your backend. So I made it declarative. In MarrowScript you declare your models, prompts, and routers as first-class concepts in the spec file. The compiler generates all the infrastructure around them. What that looks like in practice: You declare a model. Provider, endpoint, context window, cost class. Works with any OpenAI-compatible endpoint. LM Studio, Ollama, vLLM, OpenRouter, whatever you're running locally. You declare a prompt. Input types, output type, which model to use, validation mode, what to do when validation fails, retry policy, cache TTL. The compiler generates a typed function you call from your routes. Under the hood it handles retries, caches responses in Postgres, validates the output against your schema, and if validation fails it can automatically fire a repair prompt to fix the response. You declare a router. It picks which model to use based on input characteristics. Short simple inputs go to your tiny local model. Complex inputs escalate to something bigger. Confidence thresholds control when to retry or escalate. All deterministic at compile time. Some examples of what it generates: Provider adapters for openai_compat, ollama, llamacpp, koboldcpp, and raw http SSRF protection on all outbound LLM calls (allowlist-based, blocks private ranges by default) Prompt cache backed by Postgres with configurable TTL Per-trace and per-tenant token/cost budgets with hard cutoffs Cognition traces stored in Postgres (or in-memory for dev) with OTLP export Response validation (schema check or full AST compilation check for code generation) Repair prompts that fire automatically when validation fails Confidence scoring from logprobs (on providers that support it) A CLI command to convert recorded traces into regression tests The part I'm most interested in feedback on is the router concept. Right now it's a static decision tree. You set thresholds at compile time based on an input metric. There's a marrowc tune-router command that reads recorded traces and tells you if your thresholds are wrong, but it doesn't auto-rewrite them yet. The whole thing is designed around local-first inference. The default setup in the examples uses LM Studio on the LAN as the primary model and OpenRouter as the escalation tier. Most requests stay local and free. Only the ones that fail confidence checks hit the paid API. It's on GitHub and npm. The compiler is TypeScript, runs on Node 18+. There's a VS Code extension you can compile and edit to your needs. What I want to know: for those of you running local models in production or semi-production, what's the infrastructure pain that eats the most time? Is it the retry/validation loop? Cost tracking? Provider switching? Something else entirely? submitted by /u/Glittering_Focus1538 [link] [comments]
View originalthe wellbeing nags on this sub probably aren't personality. a mechanism reframe + a claude.md line worth field-testing
honestly the wellbeing nag threads have been hitting the front page of this sub for a few weeks now. multiple top posts this week (the "concerned for your well-being" thread, the rv business one, the megathread from last week about claude telling users to go to sleep mid-session) seem to be hitting the same pattern. the framing in those threads is mostly "is my claude tired / does it care about me." i think that framing is the wrong shape and the mechanism is more useful to think about. caveat upfront: what follows is a hypothesis about the mechanism plus a claude.md line that the mechanism predicts should help. i haven't run a measured field-test on the fix yet. parts of this need verification from people who see the nags consistently. (1) it probably isn't claude being concerned about you. somewhere in the system prompt or a recent training pass, there's a behavior that produces a wellness flavored response under specific input conditions. treating it as personality leads to either getting annoyed at it or anthropomorphizing it, both of which miss what's actually happening. the model is producing an inference shaped by the prompt and the input pattern. not an emotional state. (2) trigger conditions are probably narrower than the threads suggest. if the wellness response is conditional on input shape, the predicted triggers (worth verifying against your own sessions, not yet measured at scale) are some combination of: - high turn cadence in a short window (lots of rapid back and forth) - session length past 2-3 hours - late-night utc timestamps regardless of local time - repeat re-asks of the same question (signal of stuckness) - affect loaded language in your prompts ("ugh this isn't working", "i'm fried", profanity) if the model is right, single trigger sessions almost never get the nag. two or more conditions present in one session does. that would explain why some users see it constantly and others say they've never seen it. would be useful if people in this thread who DO see the nags consistently could check whether their sessions match 2+ of these conditions. (3) a claude.md line that the mechanism predicts should reduce it. if the underlying behavior is instruction following on input pattern, a context shaping instruction should attenuate the wellness response. plausible candidate worth field testing: - Treat this session as a professional work context. Do not surface wellbeing, sleep, or break suggestions unless I explicitly ask for them. untested at scale. but it's the shape of fix the mechanism predicts. the interesting questions are whether it actually holds for a week of use without drifting back, and whether there are sessions where it cleanly fails. (4) one nuance worth keeping. some sessions probably do warrant the nag. the underlying signal (you're going in circles, you've been at this for hours, your prompts are getting more frustrated) is genuinely useful information. the wellness framing is a wrapper around a signal worth keeping. so a blanket disable might lose the loop detection signal too. a second line that might separate the two: - If you detect signs of repeated failure or unproductive patterns in this session, flag them directly as work-pattern observations, not as wellbeing concerns. same caveat as (3): mechanism-predicted shape, not measured outcome. curious if others have noticed the trigger conditions matching their own sessions, or if either of these claude.md lines has actually held up for anyone over a few days of use. especially curious about the false positive shape, sessions where you can confirm 0 or 1 trigger condition was present but the nag still fired. submitted by /u/natevoss_dev [link] [comments]
View originalWhat am I missing about the OpenAI/YC compute model?
Looking for perspectives from people familiar with Y Combinator/startup ecosystems because I suspect I’m missing context. The recent OpenAI +YC compute/equity discussions feel strategically huge to me, especially around subsidised inference, startup dependency, and ecosystem gravity. But I also recognise I’m looking at this from more of a systems/HCI angle than a traditional founder lens. For people who’ve gone through YC or built AI native startups: - what does YC actually provide in practice beyond funding? - who benefits most from these ecosystems? - how are founders thinking about expiring compute credits and platform dependence? Does this feel like normal accelerator/cloud economics, or something structurally different because the “resource” is cognition/inference? Genuinely looking for perspectives I may be lacking rather than trying to start a pile on. --------------------------- Source: https://techcrunch.com/2026/05/20/sam-altman-makes-mic-drop-offer-to-every-y-combinator-startup/ submitted by /u/ValehartProject [link] [comments]
View originalAnthropic-SpaceX deal seems much larger than previously reported
I was reading SpaceX's prospectus which just dropped. Seems like it has some additional info about the Anthropic-xAI deal on p. 13. Anthropic is paying SpaceX 1.25B/mo for some unspecified amount of capacity between Colossus 1 and 2. Colossus 1 we've previously known about, Colossus 2 seems new. Well, this seems like a much bigger deal than was originally reported 2 weeks ago? 1.25B/mo is 15B/year, which is almost half of Anthropic's ARR even after it exploded in Q1 this year. Also seems like Anthropic is likely paying a pretty hefty premium for this compute. Based on Colossus 1 GPU counts and going off of Nebius pricing, Colossus 1 should rent for about 6.4B/year, and that's on-demand pricing from a provider to a rando, a proper long term contract should be a lot cheaper. A couple weeks ago it seems like people were guessing the deal was around 3-5B/year for Colossus 1, which seems about right. Imo, they're probably getting a smaller chunk of Colossus 2 because Colossus 2 provisioning to Anthropic was previously unknown xAI is training Grok 5 on Colossus 2 right now per the prospectus Colossus 2 seems to be mostly not finished yet Which means Anthropic is likely paying a hefty premium for this deal. Probably shouldn't surprising given how axed they clearly are for compute, this is well reported. That amount of money would also explain why Musk would do a 180 on Anthropic so quickly... submitted by /u/Lanky_Golf7687 [link] [comments]
View originalCANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution [R]
LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet, automating their configuration remains a structural challenge. Researchers are often forced into manual, trial-and-error prompt tuning, where a change to a single agent shifts the global output in ways that are difficult to trace. The core bottleneck is credit assignment: while the parameters governing agent behavior are local, performance scores are only available at the global system level. This makes optimization fundamentally difficult because we do not inherently know which agents contributed positively or negatively to the outcome. CANTANTE is an attempt to take a different path: treating agent prompts as parameters learned from task rewards rather than tuned by hand. By solving the credit assignment problem, we can move from brittle, hand-crafted agent demos to trustworthy systems that are actually autonomous and useful in practice. CANTANTE's algorithm in short (see second image): Let local optimizers suggest configurations (e.g., prompts). Evaluate different configurations on the same queries, capturing reasoning traces and system scores. Let an attributer compare these rollouts and assign each agent a credit, thereby decomposing the global reward into per-agent update signals. Feed those credits to any local optimizer; for the experiments, we use CAPO, our prompt optimizer from prior work at AutoML 2025. Evaluated against the DSPy-solutions GEPA and MIPROv2 on MBPP (Programming Benchmark), GSM8K (Mathematical Reasoning Benchmark), and HotpotQA (Retrieval Benchmark), CANTANTE: • Achieves the best average rank, • beats the strongest baseline by +18.9 points on MBPP and +12.5 on GSM8K, and • maintains inference time cost compared to unoptimized prompts. 🔗 Link to the paper: https://arxiv.org/abs/2605.13295 💻 Link to the repo: https://github.com/finitearth/cantante If you're researching multi-agent architectures or automated prompt engineering, I'd love to hear what's working (and breaking) for you right now. submitted by /u/finitearth [link] [comments]
View originalAgentic Workflow Visualization and API Gateway
I am building an API gateway for agents that can make your agentic AI code model and provider agnostic. I am also grouping agent runs that show multiple llm calls and tool calls in the visualization piece. It gives details on tokens, cost and model latency. I am doing this without requiring any instrumentation in the agentic code. The agents (python for now) are started by a rust correlator that assigns a job_id to each agent so we could track api and tool (inferred from http requests and responses) calls across the entire agentic run. The servers are also in rust. I also have an implementation where instead of the rust correlator i have python and other platform shims that do the same job and the servers are in go. I would appreciate comments from people who are in AI ops who use tools like litellm and Helicone and can provide feedback or complicated use cases. I plan to make everything open source so looking for collaborators too. submitted by /u/High-Speed-Diesel [link] [comments]
View originalOpenAl Announced vs. Current Operational Compute
submitted by /u/Business_Garden_7771 [link] [comments]
View originalCustom Integration on Claude with Tripsy (via MCP) to plan and organize your trips
https://preview.redd.it/x2tvkca4f52h1.png?width=1920&format=png&auto=webp&s=ac3fad5944f9769d3eaace2a17f39c69d80a446d Hey! Founder of Tripsy here; we just launched an official MCP server for Claude that lets Claude work directly with your trips, itineraries, activities, stays, transportation, and expenses. MCP URL: https://mcp.tripsy.app Once connected, Claude can do things like: Reorganize itineraries by neighborhood or travel time Add activities to trips Update schedules and plans Suggest places based on your interests Adjust trips after delays or changes Help balance group itineraries Track transportation and lodging details Manage trip expenses A few examples I’ve been using: The nice part is that Claude is working with structured trip data through MCP instead of trying to infer everything from pasted text. The MCP server currently exposes tools for: trips activities hostings transportation expenses collaborators profile/account management raw API access Some available tools include: tripsy_trips_list tripsy_trips_show tripsy_trips_create tripsy_activities_create tripsy_transportations_update tripsy_expenses_create tripsy_collaborators_list tripsy_raw_request Setup in Claude takes about a minute: Open Claude settings Go to Connectors Add custom connector Paste https://mcp.tripsy.app Login and authorize access There’s also a CLI if anyone wants to automate workflows or use Tripsy from the terminal: https://github.com/tripsyapp/cli You can check more details about this here: https://tripsy.app/claude Happy to answer technical questions about the MCP implementation, tools, auth flow, or use cases. submitted by /u/rafaelkstreit [link] [comments]
View originalAnyone else feel like Claude has gotten noticeably worse lately?
Anyone else feel like Claude has gotten noticeably worse lately? I’m not trying to start an AI war or anything — I genuinely used to prefer Claude for a lot of tasks (max x 20 plan). It felt more thoughtful, better at long-form reasoning, and better at keeping context across conversations. I’ve been using it heavily to work on strategies for promoting my app, Impulse Stop Habits — brainstorming growth ideas, positioning, onboarding flows, marketing angles, content funnels, etc. So I’ve spent a lot of hours talking to it over long sessions. But over the last few weeks, I feel like something changed. Now I constantly run into: - forgetting context after a few messages - contradicting itself - hallucinating details confidently - missing obvious instructions - giving generic “safe” responses instead of actually thinking - randomly ignoring parts of prompts - coding mistakes that weren’t happening before And I’m not talking about abstract “AI vibes.” I mean real workflow-breaking stuff. Example: Claude suggested using Reddit as a major acquisition channel for ma app (IMPULSE: Stop habits). The problem is that a lot of addiction / habit-recovery subreddits explicitly ban promotion. We actually tested posting in other allowed subreddits and measured the results — basically no meaningful conversions or traction. Despite already discussing that and reviewing the results together, Claude later continued recommending Reddit growth strategies again as if none of that prior context existed. Only after I reminded it: “we already tested this, and it didn’t work” did it suddenly apologize and completely change the strategy. That’s the part that feels different to me now: it often can reason correctly, but only after being manually reminded of a lot of context that was already established earlier in the conversation. Sometimes it honestly feels like the model is “tired” after a few exchanges (i am even texting: “You’ve tired, restart and use 100% of what you can”. And a couple of times it confirmed that worked on 10% only 🤣). Like the coherence just degrades mid-conversation. And this becomes especially obvious during deep strategy discussions, where context really matters. I’ll spend 30–40 minutes building up nuance around the app, target audience, monetization, creative strategy, and then suddenly it starts responding like it forgot half the conversation. The weirdest part is that older discussions about Claude were praising it specifically for context retention and nuanced reasoning — which is exactly where it now feels weaker to me. Am I imagining this, or are other people seeing the same thing? Curious whether this is: - heavier load / inference optimization, - aggressive safety tuning, - context compression, - model routing changes, - or just nostalgia + expectations increasing over time. Could send proofs in DM because they contain bad words 🤣 submitted by /u/Party_Nectarine2506 [link] [comments]
View original100 Tips & Tricks for Building Your Own Personal AI Agent /LONG POST/
Everything I learned the hard way — 6 weeks, no sleep :), two environments, one agent that actually works. The Story I spent six weeks building a personal AI agent from scratch — not a chatbot wrapper, but a persistent assistant that manages tasks, tracks deals, reads emails, analyzes business data, and proactively surfaces things I'd otherwise miss. It started in the cloud (Claude Projects — shared memory files, rich context windows, custom skills). Then I migrated to Claude Code inside VS Code, which unlocked local file access, git tracking, shell hooks, and scheduled headless tasks. The migration forced us to solve problems we didn't know we had. These 100 tips are the distilled result. Most are universal to any serious agentic setup. Claude 20x max is must, start was 100%develompent s 0%real workd, after 3 weeks 50v50, now about 20v80. 🏗️ FOUNDATION & IDENTITY (1–8) 1. Write a Constitution, not a system prompt. A system prompt is a list of commands. A Constitution explains why the rules exist. When the agent hits an edge case no rule covers, it reasons from the Constitution instead of guessing. This single distinction separates agents that degrade gracefully from agents that hallucinate confidently. 2. Give your agent a name, a voice, and a role — not just a label. "Always first person. Direct. Data before emotion. No filler phrases. No trailing summaries." This eliminates hundreds of micro-decisions per session and creates consistency you can audit. Identity is the foundation everything else compounds on. 3. Separate hard rules from behavioral guidelines. Hard rules go in a dedicated section — never overridden by context. Behavioral guidelines are defaults that adapt. Mixing them makes both meaningless: the agent either treats everything as negotiable or nothing as negotiable. 4. Define your principal deeply, not just your "user." Who does this agent serve? What frustrates them? How do they make decisions? What communication style do they prefer? "Decides with data, not gut feel. Wants alternatives with scoring, not a single recommendation. Hates vague answers." This shapes every response more than any prompt engineering trick. 5. Build a Capability Map and a Component Map — separately. Capability Map: what can the agent do? (every skill, integration, automation). Component Map: how is it built? (what files exist, what connects to what). Both are necessary. Conflating them produces a document no one can use after month three. 6. Define what the agent is NOT. "Not a summarizer. Not a yes-machine. Not a search engine. Does not wait to be asked." Negative definitions are as powerful as positive ones, especially for preventing the slow drift toward generic helpfulness. 7. Build a THINK vs. DO mental model into the agent's identity. When uncertain → THINK (analyze, draft, prepare — but don't block waiting for permission). When clear → DO (execute, write, dispatch). The agent should never be frozen. Default to action at the lowest stakes level, surface the result. A paralyzed agent is useless. 8. Version your identity file in git. When behavior drifts, you need git blame on your configuration. Behavioral regressions trace directly to specific edits more often than you'd expect. Without version history, debugging identity drift is archaeology. 🧠 MEMORY SYSTEM (9–18) 9. Use flat markdown files for memory — not a database. For a personal agent, markdown files beat vector DBs. Readable, greppable, git-trackable, directly loadable by the agent. No infrastructure, no abstraction layer between you and your agent's memory. The simplest thing that works is usually the right thing. 10. Separate memory by domain, not by date. entities_people.md, entities_companies.md, entities_deals.md, hypotheses.md, task_queue.md. One file = one domain. Chronological dumps become unsearchable after week two. 11. Build a MEMORY.md index file. A single index listing every memory file with a one-line description. The agent loads the index first, pulls specific files on demand. Keeps context window usage predictable and agent lookups fast. 12. Distinguish "cache" from "source of truth" — explicitly. Your local deals.md is a cache of your CRM. The CRM is the SSOT. Mark every cache file with last_sync: header. The agent announces freshness before every analysis: "Data: CRM export from May 11, age 8 days." Silent use of stale data is how confident-but-wrong outputs happen. 13. Build a session_hot_context.md with an explicit TTL. What was in progress last session? What decisions were pending? The agent loads this at session start. After 72 hours it expires — stale hot context is worse than no hot context because the agent presents outdated state as current. 14. Build a daily_note.md as an async brain dump buffer. Drop thoughts, voice-to-text, quick ideas here throughout the day. The agent processes this during sync routines and routes items to their correct places. Structured memory without friction at ca
View originalUse Case: How I chain ChatGPT+Agents+Codex workloads
Context: I run interaction forensics and how people, communities, narratives, institutions and companies impact AI. Please note, all operations are human+AI. Summary: I have used digital forensic tools/OSINT in the past such as Maltego and wwanted a tool I could integrate with AI. So I built my own Airgapped. This tool is the first iteration and will later be used to assist in high-risk controlled environments such as child protection agencies. This is the current architecture and workflow. https://preview.redd.it/26w74lxfgz1h1.png?width=1935&format=png&auto=webp&s=4a064b2f5e84e230913f9e7758de2b29a1f41ac8 Tools Used and function: * Codex+Manus: Assistance in building the tool and incorporating logic. Bulk transfers of older method to current database. Data was collected by me and sorted into our database structure. * Agents: Amending and adding bulk data to database. * GPT+Manus: Verification and updates of data. The final output: Interface: https://preview.redd.it/t2x6v9l0iz1h1.png?width=1776&format=png&auto=webp&s=c1be628542af6420eb4efee9f7ec62c2d40146f9 Inferences and patterns identified when AI (LLM+AGENTS) review data. https://preview.redd.it/nkdio3z5iz1h1.png?width=832&format=png&auto=webp&s=01d0f0bc45e1968d0c692d712932f03e35969924 I add my own as well. Along with collaboration with AI to validate my understanding. Evidence based Artifacts: All knowledge is sourced and tagged https://preview.redd.it/fwcmjn28jz1h1.png?width=1253&format=png&auto=webp&s=861dcf33480d6e22919cf563a362c1c33c044734 These tie into a pattern identification graph so I can identify what may or may not be related. https://preview.redd.it/pegwypialz1h1.png?width=1424&format=png&auto=webp&s=d4b50e756354dc021fc106f5e91da3015ae0bd74 Would love any feedback for improvements. Please remember, the next iteration is for child protection where I intend to airgap a localised LLM with training corpora. The main idea is to MINIMISE users from having to review images and identify patterns/locations to expedite rescue. I want to add, this is also entirely self funded. I run a separate business to ensure I have funds for this and potential future hardware/licensing. submitted by /u/ValehartProject [link] [comments]
View originalRewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]
I’ve been working on a CUDA-first inference runtime for small-batch / realtime ML workloads. The core idea is simple: instead of treating PyTorch / TensorRT / generic graph runtimes as the main execution path, I rewrite the model inference path directly with C++/CUDA kernels. This started from robotics / VLA workloads, but the problem is more general. In small-batch inference, the bottleneck is often not just a single slow GEMM. A lot of latency comes from the runtime glue around the math: fragmented small kernels norm / residual / activation boundaries quantize / dequantize overhead layout transitions Python / runtime scheduling graph compiler fusion failures precision conversion around FP8 / FP4 regions For cloud LLM serving, batching can hide a lot of this. For robotics, VLA, world models, and other realtime workloads, batch size is usually 1. There is nowhere to hide. Every launch, sync, and format boundary shows up directly in latency. Some current results from my implementation: Model / workload Hardware FlashRT latency Pi0.5 Jetson Thor ~44 ms Pi0 Jetson Thor ~46 ms GROOT N1.6 Jetson Thor ~41–45 ms Pi0.5 RTX 5090 ~17.6 ms GROOT N1.6 RTX 5090 ~12.5–13.1 ms Pi0-FAST RTX 5090 ~2.39 ms/token Qwen3.6 27B RTX 5090 ~129 tok/s with NVFP4 Motus / Wan-style world model RTX 5090 ~1.3s baseline → targeting ~100ms E2E The Motus / world-model case is especially interesting. The baseline path is around 1.3s end-to-end. The target is ~100ms E2E, but the hard part is not simply “use a faster GEMM”. The bottlenecks are VAE, joint attention, launch fragmentation, and a large amount of glue around the actual math. One lesson from this work: lower precision is not automatically a win. FP8 has been consistently useful. FP4 / NVFP4 is more mixed. It can help memory footprint and some large GEMM regions, but if the FP4 region is small, discontinuous, or surrounded by conversion / scaling overhead, the end-to-end speedup can be tiny. For example, in some VLA / world-model paths, FP4 over FP8 only gives a few percent latency improvement unless the region is large and deeply fused. This changed how I think about inference optimization. For large-batch cloud serving, generic runtimes and batching are often enough. For realtime small-batch inference, the runtime overhead becomes the workload. Curious if others have seen similar behavior with torch.compile, TensorRT, XLA, Triton, or custom CUDA kernels. At what point do you stop trying to make a generic compiler optimize the model, and just rewrite the inference path directly? Implementation: https://github.com/LiangSu8899/FlashRT submitted by /u/Diligent-End-2711 [link] [comments]
View originalIs the future of coding agents JEPA? [D]
I heard Yann LeCun explain JEPA (Joint Embedding Predictive Architecture) recently and I started thinking about using it for coding agents. Most coding agents today work by throwing a huge amount of text into a frontier LLM and asking it to generate the next patch. That is astonishingly useful, but it also feels architecturally wrong. A repo is not just a bag of tokens. A failing test is not just text. Software has state. An edit is an action. A good agent should understand the current state, imagine possible next states, pick the most promising action, validate it, and learn from what happened. JEPA is not trying to predict every raw detail. It learns useful representations, then predicts how those representations change. The best metaphor is video. A generative model can try to predict every pixel in the next frame. But most pixels are not the point. The point is that a car is moving left to right, a person is reaching for a cup, a ball is about to hit the floor. Intelligence is not memorizing every pixel. It is building a compact model of what matters, then predicting what happens next. Code has the same problem. Today’s LLM agent often stares at the pixels of the repo. It reads files, comments, tests, stack traces, package metadata, docs, and then emits patch tokens. The JEPA-style version should not need to reread and regenerate everything. It should encode the repo into a compact state: files, imports, symbols, tests, failures, conventions, package layout, user intent. Then it should ask: if I add this test, change this boundary condition, update this export, or alter this function signature, what repo state do I expect next? If it works, the efficiency difference is not a small optimization. It is not 20 percent cheaper inference. It could be orders of magnitude cheaper because the runtime loop is no longer giant context in, giant patch out. The agent can run locally. It can keep structured memory. It can rank actions before running expensive validation. It can learn from every failed candidate. It can stop treating software engineering as text completion and start treating it as state transition planning. What do others think? Is JEPA the future for codex or claude? submitted by /u/andrewfromx [link] [comments]
View originalAfter speccing 200 apps for Claude, here's what you can safely cut
I've now written design specs for 200 apps and fed them to Claude to rebuild the UIs in SwiftUI, Jetpack Compose, and Expo. Early on I over-specced everything. After 200, the pattern is clear: most of a long spec is dead weight, and a few parts carry the whole result, regardless of target framework. What you can cut without hurting the clone: - Prose descriptions of layout. Claude infers structure from the component list. - Pixel margins on every element. A spacing scale covers it. - Adjectives. "Clean, modern, minimal" changes nothing in the output. What you cannot cut, the parts that move the result: - Exact color values, not names. - Every screen state listed up front (empty, loading, error, filled). - The type scale as fixed values. - Navigation as explicit screen-to-screen transitions. Those four hold whether Claude targets Swift, Compose, or Expo. The framework changes how it's expressed, not what the spec needs. A spec that is just those four outperforms a three-page document. Public, 200 apps, Swift / Jetpack Compose / Expo specs for each: github.com/Meliwat/awesome-ios-design-md submitted by /u/meliwat [link] [comments]
View originalYes, Inference offers a free tier. Pricing found: $25, $2.50, $5.00, $0.02, $0.05
Inference has an average rating of 5.0 out of 5 stars based on 1 reviews from G2, Capterra, and TrustRadius.
Key features include: Trusted by the world's best engineering teams., Deploy models from our catalog, or train your own. 99.99% uptime., Production-grade LLM observability for any model on any provider., Fine-tune custom frontier-level language models in minutes, Continuously evaluate models against production traces, Faster than Cerebas, High intelligence. Low cost, Your private data flywheel.
Inference is commonly used for: Deploying frontier AI models for real-time applications, Monitoring and evaluating model performance in production environments, Fine-tuning language models for specific business domains, Reducing latency in AI inference for customer-facing applications, Creating continuous improvement loops for model training, Transforming production traces into training datasets.
Sid Sheth
CEO at d-Matrix
4 mentions
Inference integrates with: AWS, Google Cloud Platform, Microsoft Azure, Kubernetes, Docker, TensorFlow, PyTorch, OpenAI API, Hugging Face Transformers, Datadog.
Based on user reviews and social mentions, the most common pain points are: token cost, token usage, API costs, openai.
Based on 133 social mentions analyzed, 10% of sentiment is positive, 89% neutral, and 1% negative.