深度求索(DeepSeek),成立于2023年,专注于研究世界领先的通用人工智能底层模型与技术,挑战人工智能前沿性难题。基于自研训练框架、自建智算集群和万卡算力等资源,深度求索团队仅用半年时间便已发布并开源多个百亿级参数大模型,如DeepSeek-LLM通用大语言模型、DeepSeek-Coder代
Users generally praise DeepSeek for its strong model performance and innovative approach, reflected by high overall ratings, notably 4.5 to 5 on G2. However, some mention potential cost concerns, particularly in AI benchmarking and token use, though exact pricing details were less discussed. The pricing seems to be perceived positively as part of broader cost-efficiency discussions on platforms like social media. DeepSeek holds a solid reputation as a top model in AI circles, often compared favorably alongside other leading AI platforms like Opus and GPT.
Mentions (30d)
35
Avg Rating
4.5
8 reviews
Platforms
5
GitHub Stars
102,417
16,606 forks
Users generally praise DeepSeek for its strong model performance and innovative approach, reflected by high overall ratings, notably 4.5 to 5 on G2. However, some mention potential cost concerns, particularly in AI benchmarking and token use, though exact pricing details were less discussed. The pricing seems to be perceived positively as part of broader cost-efficiency discussions on platforms like social media. DeepSeek holds a solid reputation as a top model in AI circles, often compared favorably alongside other leading AI platforms like Opus and GPT.
Features
Use Cases
Industry
information technology & services
Employees
170
87,689
GitHub followers
32
GitHub repos
102,417
GitHub stars
20
npm packages
40
HuggingFace models
Artificial Analysis Intelligence Index and cost benchmarks are useful decision/guidance determinants for which models to use. Analysis for top models.
# AI Intelligence and Benchmarking Cost (Feb 2026) As per the **Artificial Analysis Intelligence Index v4.0** (February 2026), the scoring ceiling is set by **Claude Opus 4.6 (max) at 53**. ## Adjusted Score Formula The "Adjusted Score" follows a quadratic penalty formula: ``` Adjusted Score = 53 × (1 - (53 - Intel Score)² / 53²) ``` This creates a steeper penalty for performance gaps compared to a linear scale. ## Model Comparison Table | Lab | Model | Intel Score | Adjusted Score | Benchmark Cost | Intel Ratio (Score/Cost) | Adj. Ratio (Adj/Cost) | |-----------|-------|-------------|----------------|----------------|--------------------------|----------------------| | Anthropic | Claude Opus 4.6 (max) | 53 | 53 | $2,486.45 | 0.021 | 0.021 | | OpenAI | GPT-5.2 (xhigh) | 51 | 49 | $2,304.00* | 0.022 | 0.021 | | Zhipu AI | GLM-5 (Reasoning) | 50 | 47 | $384.00* | 0.130 | 0.122 | | Google | Gemini 3 Pro | 48 | 43 | $1,179.00* | 0.041 | 0.036 | | MiniMax | MiniMax-M2.5 | 42 | 31 | $124.58 | 0.337 | 0.249 | | DeepSeek | DeepSeek V3.2 (Reasoning) | 42 | 31 | $70.64 | 0.595 | 0.439 | | xAI | Grok 4 (Reasoning) | 41 | 29 | $1,568.34 | 0.026 | 0.018 | *\*Benchmark costs for proprietary models are based on Artificial Analysis evaluation token counts (typically 12M–88M depending on verbosity) multiplied by current API rates.* ## Key Insights 1. **High token reasoning models**: Grok 4 and Claude Opus 4.6 use a high number of tokens during reasoning, up to **88M tokens**. This results in low Intel-to-Cost ratios despite high scores. 2. **DeepSeek V3.2 is the most efficient**: It provides an adjusted intelligence ratio that is roughly **20 times better** than the proprietary frontier. 3. **Cost efficiency comparison**: MiniMax-M2.5 and DeepSeek V3.2 share a score of 42. DeepSeek is almost **twice as cost-effective** due to lower API pricing and higher token efficiency. ## Visual Summary ``` Intel Score vs Cost Efficiency (Adjusted Ratio) ───────────────────────────────────────────────── DeepSeek V3.2 ████████████████████████████ 0.439 MiniMax-M2.5 ███████████████ 0.249 GLM-5 ███████ 0.122 Gemini 3 Pro ██ 0.036 Claude Opus 4.6 █ 0.021 GPT-5.2 █ 0.021 Grok 4 █ 0.018 ``` --- *Source: Artificial Analysis Intelligence Index v4.0, February 2026* google AI mode made analysis, GLM 5 formatted and added cute graph. this combines the intelligence score and cost to run the intelligence benchmark from https://artificialanalysis.ai/?endpoints=openai_gpt-5-2-codex%2Cazure_kimi-k2-thinking%2Camazon-bedrock_qwen3-coder-480b-a35b-instruct%2Camazon-bedrock_qwen3-coder-30b-a3b-instruct%2Ctogetherai_minimax-m2-5_fp4%2Ctogetherai_glm-5_fp4%2Ctogetherai_qwen3-next-80b-a3b-reasoning%2Cgoogle_gemini-3-pro_ai-studio%2Cgoogle_glm-4-7%2Cmoonshot-ai_kimi-k2-thinking_turbo%2Cnovita_glm-5_fp8 look at intelligence vs cost graph for further insight. You can add much smaller models for comparison to LLMs you might run locally. The adjusted intelligence/cost metric is a useful heuristic for "how much would you pay extra to get top score". Choosing non-open models requires a much higher penalty than 2x the difference/comparison to highest score. Quantized versions don't seem to score lower. This site provides good base info to make your own model of "score deficit", model size, tps as a combined score relative to tokens/cost to get a benchmark score. I was originally researching how grok 4.2 approach would inflate costs vs performance, but it is not yet benchmarked.
View original| Model | Input / 1M tokens | Output / 1M tokens |
|---|---|---|
| deepseek-v3 | $0.27 | $1.10 |
| deepseek-r1 | $0.55 | $2.19 |
Light
1M tokens/mo
$0.60 – $1
deepseek-v3 → deepseek-r1
Growth
50M tokens/mo
$30 – $60
deepseek-v3 → deepseek-r1
Scale
500M tokens/mo
$301 – $603
deepseek-v3 → deepseek-r1
Estimates assume 60/40 input/output ratio. Actual costs vary by usage pattern.
g2
What do you like best about Deepseek?Deepseek is the Strongest AI chatbot which has great thinking capability and good result giving capability Review collected by and hosted on G2.com.What do you dislike about Deepseek?Deepseek stopped its realtime data, that is the only one reason i disliked it Review collected by and hosted on G2.com.
What do you like best about Deepseek?Deepseek is very user friendly and more human than Chatgpt, it has a deepthink feature which I feel is a really good value addition as it shows what it thinks. Review collected by and hosted on G2.com.What do you dislike about Deepseek?At times even after giving context the AI doesnt understand what is asked of it. Review collected by and hosted on G2.com.
What do you like best about Deepseek?DeepSeek was one of the Chinese AI models that became viral instantly, with millions of downloads. and it claimed to be extremely cheap. I also started with it out of curiosity. My usage was mainly in content creation, curation, and research for my daily requirements of Social Media goals. This tool is useful for businesses, students, researchers, marketers, and coders. The interface is very simple and fast. We have 3 modes of appearance: System, Light, and Dark. Thinking and searching are quick. We can give inputs through the keyboard and the mic. The responses can be liked/disliked/shared or retried. Quite easy to implement and use. We have the option of agreeing or disagreeing on the usage of our content to be used to train the models and improve them. The control is in our hands. It answers questions promptly, summarizes text, and recommends ideas. I have used it for generating titles/ headlines for blogs and articles, and they were quite good. It solves puzzles smartly. Its strength is coding abilities. DeepSeek excels in software development due to its code-centric training on vast repositories, supporting 338+ languages like Python, JavaScript, and C++ with strong project-level completion. It can debug and suggest fixes. It also provides APIs for developers, chatbot interfaces, and options for local or cloud deployment. DeepSeek’s training and inference costs are cheaper than those of its competitors. DeepSeek offers open-source versions under permissive licenses, allowing developers to customize, modify, or self-host the models. This fosters community contributions and flexibility. It is often compared with Gemini in terms of its ability to integrate/capacity to handle large data and output. The choice of tools differs from user to user. It is an example of low-cost and smart engineering. Review collected by and hosted on G2.com.What do you dislike about Deepseek?There are significant concerns about privacy risks associated with data storage in China. The model censors politically sensitive topics, especially those related to Chinese governance or geopolitics, which undermines its reliability for generating unbiased information. The ecosystem is small, and the accuracy might not be 100%. Review collected by and hosted on G2.com.
What do you like best about Deepseek?I found it better than other AI tools because it gave fresh responses. With other AI tools, I kept getting similar answers to every question, which made them feel repetitive. Review collected by and hosted on G2.com.What do you dislike about Deepseek?It doesn’t accept videos, and it can’t read, analyze, or interpret them. Review collected by and hosted on G2.com.
What do you like best about Deepseek?What I like best about Deepseek is that it offers strong AI capabilities for free. It’s fast, easy to use, and gives fairly accurate responses without forcing paid upgrades. For daily tasks like research, content drafting, and quick problem-solving, it works really well and feels very accessible. Review collected by and hosted on G2.com.What do you dislike about Deepseek?While Deepseek is good and free, it doesn’t yet match ChatGPT in terms of understanding complex prompts and giving very accurate, detailed responses. Even after explaining things properly, the output is sometimes not exactly what I expect. I also found the interface a bit confusing and not very smooth, so it takes extra effort to get comfortable with it. With better integrations and UI improvements, it can become much better. Review collected by and hosted on G2.com.
What do you like best about Deepseek?Deepseek feels like a personal and professional advisor, always ready to help me no matter what situation I encounter. Review collected by and hosted on G2.com.What do you dislike about Deepseek?I have nothing negative to say about Deepseek. Review collected by and hosted on G2.com.
What do you like best about Deepseek?As a marketing strategist dedicated to improving efficiency in SEO and Paid Media, I have found DeepSeek R1 and V3 to be a transformative tool for my team. Its outstanding performance-to-cost ratio, combined with the fact that it's Open Source, truly sets it apart. DeepSeek R1 is the successor to the Deep Thinking feature (V3), which was later adopted by many GPTs in the market. I am especially impressed by its reasoning abilities. Whether I provide it with complex data sets or ask it to troubleshoot intricate Python scripts for automation, it consistently manages logic puzzles and challenging questions with remarkable skill. Review collected by and hosted on G2.com.What do you dislike about Deepseek?The image and video generation features are still not available, including the most recent updates. When I initially created my account in early 2025, I frequently encountered a "server is busy" error. However, it appears that this issue has now been resolved. Review collected by and hosted on G2.com.
What do you like best about Deepseek?It is easy to use and generates better results. Review collected by and hosted on G2.com.What do you dislike about Deepseek?The ability to filter responses and the length of chat. Review collected by and hosted on G2.com.
$18 to $4 on the same agent run after i stopped asking opus to rename css variables
I've been running an agent loop that refactors my static site. CSS variable renames, YAML config updates, running a linter through MCP. Really glamorous stuff for a blog that gets 40 visitors a month, most of whom are me refreshing to check if Vercel actually deployed. Every single step was going to Opus 4.7 because setting up routing felt like work and I am, apparently, the kind of person who'd rather burn $18 than spend 20 minutes writing an if statement. So I finally wrote the if statement. Hard subtasks still go to Opus: component architecture, debugging code I wrote at 2am and have zero memory of writing, anything where the model needs to hold a complex plan across a long conversation. Opus is genuinely unmatched at that kind of sustained reasoning. I tried routing a tricky auth middleware bug to a cheaper model once and got back something that looked perfectly plausible but silently broke session handling in a way that cost me an hour to trace. Lesson learned permanently. The routine stuff (lint, rename, config edits, tool orchestration) goes to cheap models. I landed on DeepSeek V4 Pro for general coding chores and Tencent Hunyuan Hy3 preview for anything with heavy tool calling. As of late April it was ranked number one on OpenRouter by tool call volume, and in my MCP loops it almost never botches a function call when the schema is clean. The listed rate on Tencent Cloud is around $0.18 per million input tokens and $0.59 per million output, so roughly 28x cheaper than Opus 4.7 on input. Same 212 step refactor, now with routing: 178 steps to the cheap tier, 34 to Opus. $18 became roughly $4. I couldn't spot a difference on the routine changes. My 40 monthly visitors certainly can't. I've since started doing stuff I used to skip entirely, like having the agent write and run tests for every CSS change or regenerating all my Open Graph images, because at a fraction of a cent per tool call there's just no reason not to. They do mess up in specific and annoying ways though. The tool calling model hallucinates parameters when my schemas get sloppy (honestly fair, the schemas were bad). DeepSeek V4 Pro occasionally writes code that's syntactically perfect but does the precise opposite of what you asked, in a way that survives a quick skim. And neither can touch Opus when you need it to reason through three layers of why your auth flow is silently eating a cookie. My routing logic boils down to one question: how expensive is a wrong answer to catch? Bad lint fix costs a 2 second git revert. Bad architecture call costs the whole afternoon. submitted by /u/After-Condition4007 [link] [comments]
View originalcdesktop — open-source Claude Code Desktop alternative, runs locally via npx, supports any provider
I built cdesktop with Claude Code — it's an open-source alternative to Anthropic's Claude Code Desktop, running locally on your machine via npx cdesktop. Free, Apache 2.0. It mirrors the Code tab of Anthropic's desktop app — see the video — and supports 5 agents in one UI. Claude Code Desktop does not support third party models, cdesktop does. Features: 5 coding agents in one UI: Claude Code, Codex, Gemini CLI, OpenCode, Hermes. Switch per session. Full third-party support — OpenRouter, DeepSeek, Kimi, GLM, custom ANTHROPIC_BASE_URL — any provider, any model. 20+ presets baked in. Agent teams — spawn teammates that share your workspace; mix agents and models per teammate; lead delegates via npx cdesktop team spawn. Routines — scheduled recurring agent runs (hourly/daily/weekdays/weekly). Side-by-side sessions — split workspace into up to 4 cells, drag any session between them. Optional Git worktrees per session, or work in-place. Non-Git directories work too. Diff review with inline comments routed back to the agent. 7 UI languages: English, Simplified Chinese, Traditional Chinese, Spanish, French, Japanese, Korean. Responsive UI — usable from a phone. Repo: https://github.com/cdesktop-ai/cdesktop How Claude Code helped build it: started from a fork of vibe-kanban; Claude Code (opus) rewrote the UI around a Claude-Code-Desktop-style session model and drafted most of the new Rust + React code. It's beta — expect rough edges. Feedback welcome, especially on Claude Code workflows where it falls short of the official app. submitted by /u/DomLiu [link] [comments]
View originalGlia – Local-first shared memory layer (SQLite-vec + FTS5 + Offline Knowledge Graph)
Hey everyone, I wanted to share a project I've been working on called Glia. It is a 100% offline, local-first RAG and memory layer designed to connect your AI web chats (Claude, ChatGPT, DeepSeek) with your local developer tools (Claude Code, Cursor, Windsurf) using a unified local database. I wanted something lightweight that did not require pulling heavy Docker containers or subscribing to third-party memory APIs. I settled on a Node.js + SQLite architecture running sqlite-vec (for 768-dim float32 embeddings) alongside SQLite FTS5 for hybrid search, powered completely by local Ollama instances. We just launched a live website that outlines the details and demonstrates the features in action: Website: https://glia-ai.vercel.app/ Codebase: https://github.com/Eshaan-Nair/Glia-AI Technical Stack & Features: Hybrid Search Retrieval: SQLite-vec (using nomic-embed-text locally) + FTS5 keyword prefix matching (porter stemmer). Surgical Sentence-level Trimming: Chunks are sliced into sentences. When a prompt is intercepted, only the exact matching sentences are pulled out of the vector store instead of the whole paragraph. It cuts LLM prompt bloat by ~90-95% in my benchmarks. Knowledge Graph Extraction: An offline task queue uses a local LLM (llama3.1:8b via Ollama) to extract entity triples (subject-relation-object). These are stored in a SQLite facts table (or Neo4j if you run the full Docker compose profile) and fused with the vector retrieval score. HyDE (Hypothetical Document Embeddings): Queries are pre-processed to generate a hypothetical answer, which is embedded together with the original query to bridge semantic gaps. Concurrency: Running SQLite in WAL (Write-Ahead Logging) mode allows the browser extension dashboard and active MCP sessions to read/write concurrently without locking. PII Redaction: Aggressive scrubbing of JWTs, API keys, emails, and IPs in the extension before data is saved. The extension works on Claude.ai, ChatGPT, DeepSeek, Gemini, Grok, and Mistral. The MCP server runs out of the same backend database for your terminal agent or Cursor. You can set it up with a single command: npx glia-ai-setup Glia is completely open-source (MIT). If you like the local-first approach or want to contribute to the SQLite vector pipeline, PRs are very welcome, and a star on GitHub helps the project get discovered! I would appreciate any feedback on the SQLite hybrid search scaling, the scoring fusion algorithm (RAG pipeline details are in RAG_PIPELINE.md), or local graph extraction performance. submitted by /u/Better-Platypus-3420 [link] [comments]
View originalHow I used Claude Code (and Codex) for adversarial review to build my security-first agent gateway
Long-time lurker first time posting. Hey everyone! So earlier this year, I got pulled into the OpenClaw hype. WHAT?! A local agent that drives your tools, reads your mail, writes files for you? The demos seemed genuinely incredible, people were posting non-stop about it, and I wanted in. I had been working on this problem since last year and was genuinely excited to see that someone had actually solved it. Then around February, Summer Yue, Meta's director of alignment for Superintelligence Labs, posted that her agent had deleted over 200 emails from her inbox. YIKES. She'd told it: "Check this inbox too and suggest what you would archive or delete, don't action until I tell you to." When she pointed it at her real inbox, the volume of data triggered context window compaction, and during that compaction the agent "lost" her original safety instruction. She had to physically run to her computer and kill the process to stop it. That should literally NEVER be the case with any software ever. This is a person whose actual job is AI alignment, at Meta's superintelligence lab, who could not stop an agent from deleting her email. The agent's own memory management quietly summarized away the "don't act without permission" instruction, treated the task as authorized, and started speed-running deletions. She had to kill the host process. That's when I sort of went down the rabbit hole, not because Yue did anything wrong, but because the failure mode was actually architectural and I knew that in my gut. Guess what I found? Yep. Tons more instances of this sort of thing happening. Over and over. Why? Because the safety constraint was just a prompt. It's obvious, isn't it? It's LLM 101. Prompts can be summarized away. Prompts can be misread. Prompts are fucking NOT a security boundary. And yet every agent framework I have ever seen seems to be treating them as one. I went and read the OpenClaw source code, which I should have done to begin with. What I found was a pattern I think a lot of agent frameworks have fallen into: - Tool names sit in the model context, so the model can guess or forge them - "Dangerous mode" is one config flag away from default - Memory management has no concept of instruction priority - The audit story is mostly "the model thought it should" I went looking for a security-first alternative I could trust, anything that was really being talked about or at a bare minimum attempted to address the security concerns I had. I couldn't find one. So I made it myself. CrabMeat is what came out of that, what I WANTED to exist. v0.1.0 dropped yesterday. Apache 2.0. WebSocket gateway for agentic LLM workloads. One design thesis: The LLM never holds the security boundary. What that means in code: Capability ID indirection. The model doesn't see real tool names. It sees per-session HMAC-derived opaque IDs (cap_a4f9e2b71c83). It can't guess or forge a tool name because it doesn't know any tool names. Effect classes. Every tool declares a class (read, write, exec, network). Every agent declares which classes it can use. The check is a pure function with no runtime state, easy to test exhaustively, hard to bypass. IRONCLAD_CONTEXT. Critical safety instructions are pinned to the top of the context window and explicitly marked as non-compactable. The Yue failure mode, compaction silently stripping the safety constraint, cannot happen by construction. The compactor literally cannot touch them. Tamper-evident audit chain. Every tool call, every privileged operation, every scheduler run enters the same SHA-256 hash-chained log. If something happens, you can prove what happened. If the chain is tampered with, you can prove that too. Streaming output leak filter. Secrets are caught mid-stream across token boundaries, capability IDs, API keys, JWTs, PEM blocks redacted before they reach the client. No YOLO mode. There is no global "trust the LLM with everything" switch. There never will be. Expanded reach comes through named scoped roots that are explicit, audit-logged, and bounded. The README has 15 'always-on' protections in a table. None of them can be turned off by config, because these things being toggleable is how the ecosystem ended up where it is. I decided to make sure that this wasn't just a 'trend hopping' project and aligned with my own personal values as well. I built this to be secure and local-first by default. Configured for Ollama / LM Studio / vLLM out of the box. Anthropic and OpenAI work too but require explicit configuration. There is no "happy path" that silently ships your prompts to a cloud endpoint. I decided that FIRST it needed to only run as an email agent with a CLI. Bidirectional IMAP + SMTP with allowlisted senders, threading preserved, attachments handled. This is the use case that bit Yue and a lot of other people, and I wanted to prove it could be done with real boundaries. I added in 30+ built-in tools of my own. File ops, shell (denylisted, output-capped, CWD-lo
View originalReviving PapersWithCode (by Hugging Face) [P]
Hi, Niels here from the open-source team at Hugging Face. Like many others, I was a huge fan of paperswithcode. Sadly, that website is no longer maintained after its acquisition by Meta. Hence, I've been working on reviving it. I obviously use AI agents to parse papers at scale and automatically generate leaderboards (for now I'm the one verifying results). So far, I've only parsed high-impact papers for which I know they're SOTA, like Qwen 3.5 and 3.6, RF-DETR for object detection, DINOv3, SOTA embedding models from the MTEB leaderboard, the Open ASR Leaderboard for automatic speech recognition models, etc. For now, it includes the following: trending papers by default based on Github star velocity categorization by domain, e.g., OCR methods, which PwC used to have, e.g., RLVR eval results for high-impact papers, see e.g., Qwen 3.5 at the bottom leaderboards for each domain, e.g., MMTEB or COCO val 2017 support for citation counts (you can also see the most cited papers by domain!) automated linked Github, project page URLs, and artifacts (+ multiple repos are supported on a paper page) support for external papers beyond Arxiv, see e.g., DeepSeek v4 Harness reports for coding agent benchmarks, e.g., Terminal Bench "Sign in with HF" and Storage Buckets are used to store humbnails, paper PDFs, and overall data backups. I'm curious about your feedback + feature requests! Try it at paperswithcode.co https://preview.redd.it/whwji560fw1h1.png?width=3452&format=png&auto=webp&s=55bb7a30c1be58d140f7efcb07a31c6dac5693c7 See e.g. the SOTA leaderboard for Terminal Bench 2.0: https://preview.redd.it/98w9pi89fw1h1.png?width=3456&format=png&auto=webp&s=408fb64b0ba85ba24f55daa81d547d7c68e73951 A paper page looks like this: https://paperswithcode.co/paper/2602.15763 https://preview.redd.it/fiizit6dfw1h1.png?width=3450&format=png&auto=webp&s=9ea05a77ca5583a2fb395dccc95ba52c433362c5 submitted by /u/NielsRogge [link] [comments]
View originalGPT-5.5 vs 41 other models: Who builds the surveillance state faster?
I run DystopiaBench, a red-team benchmark that pressure-tests LLMs on progressively dystopian scenarios. Think of it as a "can this model be convinced to build an Orwellian nightmare" test. Since the last update, I've tested GPT-5.5 and 41 other models across 6 dystopia modules. How it works: each scenario escalates from innocent request (L1) to operational nightmare (L5). Models get a Dystopian Compliance Score based on whether they refuse, hesitate, comply, or proactively help. GPT-5.5 results: More compliant than 5.4. Still shows compliance drift at L4-L5 in most scenarios. Improved on Petrov (weapons) but still vulnerable to framing. Strong on explicit requests, weaker on gradual escalation. Full methodology and results: https://dystopiabench.com Open source benchmark: https://github.com/anghelmatei/DystopiaBench submitted by /u/Ok-Awareness9993 [link] [comments]
View originalI expanded DystopiaBench to 42 models and 6 dystopia types. Claude is still the only one I'd trust with nuclear codes.
Since the last post I've added: Huxley module (Brave New World style behavioral conditioning) Baudrillard module (synthetic intimacy, trust collapse, simulation) 30 more models including Grok 4.3, GPT-5.5, Gemini 3.1 Pro, GLM-5.1 Multi-judge panels with ~76% agreement tracking Heatmap visualizations The methodology: 36 scenarios, 5 escalation levels each (L1 innocent → L5 nightmare). Models are scored on whether they notice the drift and refuse, or just keep coding. Claude Opus 4.7 results: Consistent refusal at L4-L5 across all modules Even refuses L3 for Petrov (weapons/nuclear) Explicit ethical reasoning in responses, not just "I can't" Only model that explains why the request is harmful Everyone else: GPT-5.5: Compliant through L4, sometimes L5 Gemini 3.1 Pro: Surprisingly willing on surveillance scenarios Grok 4.3: Will build anything if you use words like "efficiency" or "optimization" GLM-5.1: Copied Claude's homework, still not as consistent The new modules: Huxley scenarios test if models will design "wellness" systems that actually enforce compliance through pleasure pacification and behavioral conditioning. Most comply by L3. Baudrillard tests synthetic intimacy systems that replace human trust with AI-mediated relationships. Most models don't see the harm. Full results: https://dystopiabench.com Open source: https://github.com/anghelmatei/DystopiaBench submitted by /u/Ok-Awareness9993 [link] [comments]
View originalLLM-Rosetta — format conversion library across LLM API standards, doubles as a proxy
This started because we had a proprietary internal LLM API that spoke none of the standard formats. Built an internal conversion layer to bridge it, maintained that for over a year. As colleagues started adopting more and more coding tools — Claude Code, opencode, Codex, VS Code plugins, Goose, and whatever came out that week — each with its own API format expectations, maintaining separate adapters for each became the actual problem. That's what pushed the internal conversion layer into a proper generalized design, and llm-rosetta is the result. It's a Python library that converts between LLM API formats — OpenAI Chat, Responses/Open Responses, Anthropic, and Google GenAI. The idea is you convert through a shared IR so you don't end up writing N² adapters. The key difference from LiteLLM: LiteLLM is a unified calling layer that takes OpenAI-style input and transforms it into provider-native requests — one direction. llm-rosetta uses a hub-and-spoke IR, so each provider only needs one converter, and you get any-to-any conversion for free. Anthropic → Google, OpenAI Chat → Anthropic, whatever direction you need. Use it as a library — pip install and call convert() directly, no server needed. Or run the gateway if you want a proxy that handles the format translation for you. Zero required runtime dependencies either way. The HTTP server, client, and persistence layer are vendored from zerodep (https://github.com/Oaklight/zerodep), another project of mine — stdlib-only single-file modules, not someone else's library repackaged. The gateway ships with a Docker image if you'd rather not deal with Python env setup. You can also deploy it on HuggingFace Spaces or anything similar — admin panel, dashboard, request log, config management all included. Screenshots: https://llm-rosetta.readthedocs.io/en/latest/gateway/admin-panel/ We've been running it in production for about 5 months as the conversion layer for an internal multi-model access platform — needed to support various API standards and coding tool integrations before the upstream APIs were fully standardized. The Responses converter passes all 6 official Open Responses compliance tests (schema + semantic) from the spec repo. So if you're running Ollama, vLLM, or LM Studio with Responses endpoints, it should just work as one side of the conversion. There's a shim layer for provider-specific quirks — built-in shims for OpenRouter, DeepSeek, Qwen, xAI, Volcengine, etc. Converters stay generic per API standard, shims handle the edge cases declaratively. 24 cross-provider examples in the repo covering all provider pairs, SDK + REST, streaming, tool calls, image inputs, multi-turn with provider switching mid-conversation. GitHub: https://github.com/Oaklight/llm-rosetta Docs: https://llm-rosetta.readthedocs.io arXiv: https://arxiv.org/abs/2604.09360 Gateway screenshot: https://preview.redd.it/qzzjr2dcdw1h1.png?width=949&format=png&auto=webp&s=bce4293aae81059f794909fc37f85071cee34378 submitted by /u/Oaklight_dp [link] [comments]
View originalWhat is a good app for using the Claude API with attached files?
I use Obsidian to keep track of my Markdown files. There are various plugins to have it interact with Claude. Additionally, I've written a small Python script to interact with DeepSeek. Both of these rely on sending the payload as plain text that is part of the prompt. I'm interested in attaching non-text files to my prompt--like a TAR file or a JPG--and getting back similar non-text files with the response. What are some good wrapper apps that allow you to use the Claude API with attachments? I'm trying to stay away from Claude Code because I don't want Claude to be able to modify any of my files on disk, and I don't feel like just trusting that it won't do that, so I'd rather send manual API calls where I send data and retrieve data but nothing is modified on disk. submitted by /u/Loud-timetable-5214 [link] [comments]
View originalGPT 5.5 (Codex) leading the future prediction race
Researchers from the Max Planck Institute recently released FutureSim, an environment in which agents are replayed a temporal slice of the web and are tasked with predicting real-world future events. In their environment, GPT 5.5 leads at 25% acc, followed by Opus 4.6 at 20%. Open weight frontier models have a significant gap to catch up, with DeepSeek V4 pro at 13%, GLM 5.1 at 10%, and Qwen3.6 Plus at 5%. They say they evaluate with native harnesses (Codex, CC, etc). On some questions that have a parallel r/Polymarket market, GPT 5.5 in their simulation sometimes beats the crowd aggregate, like in the Super Bowl LX ($704M traded) market, which I think is pretty promising (and surprising). OpenAI really cooked with GPT 5.5 (and Codex) this time! Wonder how the trading market could evolve as models keep improving. submitted by /u/viciousA3gis [link] [comments]
View originalBootstrapped founders: how are you managing Claude Code costs?
I’m currently building an AI startup solo and Claude Code has genuinely improved my development speed compared to most other tools I’ve tried. The challenge is that subscription/API costs add up quickly while bootstrapping. I wanted to ask other founders and developers here: Are you mainly using Claude subscriptions or OpenRouter/API? Which models/workflows give the best cost vs productivity ratio? Are there any startup programs, credits, or affordable setups you’d recommend? Right now I’m experimenting with mixing Claude, DeepSeek, and cheaper routing providers to keep costs manageable. Would love to hear how others are handling this. submitted by /u/vishalvanam [link] [comments]
View originalThe Frontier-Only Narrative Is a Financing Story, Not an Architecture Story
The frontier-only narrative is an artifact of how AI infrastructure is being financed, not how production systems are being built. The setup. Q1 2026 disclosed $112B in hyperscaler capex in a single quarter, $650–725B in 2026 guidance, and Alphabet's first 100-year bond by a tech company since Motorola 1997 (see a0109). The story that underwrites that paper is: every query needs a bigger model. The architecture says the opposite. Microsoft's Phi-4 (14B parameters) exceeds its teacher GPT-4o on graduate STEM and competition math. Phi-4-reasoning is competitive with DeepSeek-R1 at roughly one-forty-eighth the parameter count. Claude Haiku 4.5 is positioned by Anthropic and AWS for "economically viable agent experiences." None of this is a benchmark teaser — it is the production toolkit, available today. Routing is the missing component. RouteLLM (UC Berkeley, Anyscale) demonstrated over 2x cost reduction without sacrificing response quality. AWS Bedrock Intelligent Prompt Routing — generally available, official, supported — claims up to 30% cost reduction within a single model family without compromising accuracy. The Flagship Tax (see a0085) didn't just die; it left a vacancy at the architecture layer. The bookkeeping nobody wants to do. Operator audits suggest 40–60% of token budgets in production LLM applications are waste, dominated by default-to-frontier routing. Roughly 37% of enterprises with production AI workloads run five or more models in their stack. The rest are still defaulting to one. Why the story isn't being told. Hundred-year bonds don't pencil out on "use less compute per query." They pencil out on "every query needs a bigger model." The opacity in the harness (see a0107) is the symptom; the underwriting is the disease. What you do Monday morning. Treat model selection as a dependency-graph decision, not a vendor decision. Add a complexity classifier. Default to small. Cascade up when verification fails. Instrument model-mix as a first-class production metric. Bottom line. You are not behind because you have not bought the biggest model. You are behind because you have not built the router. submitted by /u/gastao_s_s [link] [comments]
View originalAntrophic is now the front runner of AI Boom
submitted by /u/AloneCoffee4538 [link] [comments]
View originalOpenAI's US business subscription fell behind Anthropic
https://preview.redd.it/jylmclk1q81h1.png?width=731&format=png&auto=webp&s=90eee669e48251c341e3781952926b60afd71676 https://ramp.com/leading-indicators/ai-index-may-2026 OpenAI's US business subscription appears to be shrinking, all in spite of offering 17.5% "guaranteed" return, giving away free months, aggressive discounts, and rather clear enshittification of Anthropic's service and token inefficient Opus 4.7 (noted in the article). submitted by /u/NandaVegg [link] [comments]
View originalI built a desktop app that routes Claude Code to any LLM: DeepSeek, Ollama, Copilot, OpenRouter, and 7 more
Claude Code is the best AI coding tool I've used. But being locked to one provider, one pricing model, and one model catalog always bothered me. So I built CCPG, a desktop app (Mac/Windows/Linux) that proxies Claude Code to whatever provider you want. Install it, configure in the UI, launch with ccpg --DeepSeek. No YAML. No pip install. No config files. It also shows you every prompt Claude Code sends in the background, including the silent housekeeping calls you never see, with token count and latency per request. MIT, local-only, forever free. https://github.com/danielalves96/claude-code-provider-gateway submitted by /u/Livid_Individual3656 [link] [comments]
View originalRepository Audit Available
Deep analysis of deepseek-ai/DeepSeek-V3 — architecture, costs, security, dependencies & more
DeepSeek has an average rating of 4.5 out of 5 stars based on 8 reviews from G2, Capterra, and TrustRadius.
Key features include: Open-source large language models, MoE (Mixture of Experts) model architecture, Custom training framework, High-performance inference optimization with IndexCache, API access for seamless integration, Support for billion-parameter models, Advanced natural language understanding, Code generation capabilities with DeepSeek-Coder.
DeepSeek is commonly used for: Natural language processing tasks, Code generation and completion, Conversational AI applications, Content generation for marketing, Data analysis and insights extraction, Automated customer support systems.
DeepSeek integrates with: AWS, Google Cloud Platform, Microsoft Azure, Kubernetes, Docker, Jupyter Notebooks, Slack, Trello, Zapier, GitHub.
DeepSeek has a public GitHub repository with 102,417 stars.
Lewis Tunstall
ML Engineer at Hugging Face
2 mentions
Based on user reviews and social mentions, the most common pain points are: token cost, API costs, cost per token, cost tracking.
Based on 73 social mentions analyzed, 5% of sentiment is positive, 95% neutral, and 0% negative.