DeepEval is the open-source LLM evaluation framework for testing and benchmarking LLM applications.
DeepEval is praised for its advanced technical capabilities, particularly in areas like FP4 quantization aware training, adding significant technical depth to its offerings. However, there are few detailed user-generated reviews or direct feedback available on user experience or potential shortcomings of the tool. The pricing sentiment is undiscussed in the available mentions, making it unclear how users perceive its cost in relation to its value. Overall, DeepEval seems to have a strong reputation for innovation and technical sophistication in AI evaluation, although specific user satisfaction metrics remain vague.
Mentions (30d)
6
1 this week
Reviews
0
Platforms
2
GitHub Stars
14,993
1,384 forks
DeepEval is praised for its advanced technical capabilities, particularly in areas like FP4 quantization aware training, adding significant technical depth to its offerings. However, there are few detailed user-generated reviews or direct feedback available on user experience or potential shortcomings of the tool. The pricing sentiment is undiscussed in the available mentions, making it unclear how users perceive its cost in relation to its value. Overall, DeepEval seems to have a strong reputation for innovation and technical sophistication in AI evaluation, although specific user satisfaction metrics remain vague.
Features
Use Cases
295
GitHub followers
5
GitHub repos
14,993
GitHub stars
20
npm packages
3
HuggingFace models
Reviving PapersWithCode (by Hugging Face) [P]
Hi, Niels here from the open-source team at Hugging Face. Like many others, I was a huge fan of paperswithcode. Sadly, that website is no longer maintained after its acquisition by Meta. Hence, I've been working on reviving it. I obviously use AI agents to parse papers at scale and automatically generate leaderboards (for now I'm the one verifying results). So far, I've only parsed high-impact papers for which I know they're SOTA, like Qwen 3.5 and 3.6, RF-DETR for object detection, DINOv3, SOTA embedding models from the MTEB leaderboard, the Open ASR Leaderboard for automatic speech recognition models, etc. For now, it includes the following: trending papers by default based on Github star velocity categorization by domain, e.g., OCR methods, which PwC used to have, e.g., RLVR eval results for high-impact papers, see e.g., Qwen 3.5 at the bottom leaderboards for each domain, e.g., MMTEB or COCO val 2017 support for citation counts (you can also see the most cited papers by domain!) automated linked Github, project page URLs, and artifacts (+ multiple repos are supported on a paper page) support for external papers beyond Arxiv, see e.g., DeepSeek v4 Harness reports for coding agent benchmarks, e.g., Terminal Bench "Sign in with HF" and Storage Buckets are used to store humbnails, paper PDFs, and overall data backups. I'm curious about your feedback + feature requests! Try it at paperswithcode.co https://preview.redd.it/whwji560fw1h1.png?width=3452&format=png&auto=webp&s=55bb7a30c1be58d140f7efcb07a31c6dac5693c7 See e.g. the SOTA leaderboard for Terminal Bench 2.0: https://preview.redd.it/98w9pi89fw1h1.png?width=3456&format=png&auto=webp&s=408fb64b0ba85ba24f55daa81d547d7c68e73951 A paper page looks like this: https://paperswithcode.co/paper/2602.15763 https://preview.redd.it/fiizit6dfw1h1.png?width=3450&format=png&auto=webp&s=9ea05a77ca5583a2fb395dccc95ba52c433362c5 submitted by /u/NielsRogge [link] [comments]
View originalDeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]
DeepSeek dropped the full V4 paper this week. preview from april was 58 pages, this version adds a lot of technical depth. What stood out for me. FP4 quantization aware training. theyre running FP4 QAT directly in late stage training. MoE expert weights quantized to FP4 (the main gpu memory consumer). QK path in the CSA indexer uses FP4 activations. 2x speedup on QK selector with 99.7% recall preserved. inference runs directly on the FP4 weights. Efficiency table is striking: Model 1M context FLOPs KV cache V3.2 baseline baseline V4-Pro 27% of baseline 10% of baseline V4-Flash 10% of baseline 7% of baseline Training stability, two mechanisms. Trillion parameter MoE has the loss spike problem, divergence, unpredictable failures. they documented two fixes. Anticipatory routing. they deliberately desync main model and router updates. current step uses latest params for features, but routing uses cached older params. breaks the feedback loop that amplifies anomalies. 20% overhead but only kicks in during loss spikes. SwiGLU clamping. hard limits on the SwiGLU linear path (-10 to 10) and gate path (max 10). suppresses extreme values that would cascade. Generative reward model. instead of separate reward models for RLHF, they use the same model to generate and evaluate. trained on scored data, model learns to judge its own outputs with reasoning attached. minimal human labeling, reasoning grounded eval, unified training. Human eval results. chinese writing, V4-Pro 62.7% win rate vs gemini 3.1 pro, 77.5% on writing quality specifically. white collar tasks (30 advanced tasks across 13 industries), V4-Pro-Max gets 63% non loss rate vs opus 4.6 max. coding agent eval, 52% of users said V4-Pro is ready as their default coding model, 39% leaned yes, less than 9% said no. tracks my own use, swapped V4-Pro into my verdent runs last week and havent noticed a quality hit on day to day work. The headline for me is FP4 QAT with minimal quality degradation. if this generalizes the cost structure of training and inference shifts a lot, especially noticeable on multi agent setups where one task can spawn 5-10 model calls. Paper link in comments. submitted by /u/Dramatic_Spirit_8436 [link] [comments]
View originalAnyone actually built a real feedback loop for Claude agents in production? Because "run evals and pray" isn't cutting it
So I've been running a multi-agent setup with Claude for a few months now mostly customer-facing stuff, some internal tooling. And i keep hitting this problem that I think a lot of people here are probably dealing with too but nobody really talks about. You ship a prompt change. Or you swap from Sonnet to Opus for one step in the chain. Or you add a new tool. Everything looks fine in your evals. You push it. Then three days later someone on the team notices the agent is subtly doing something wrong not catastrophically wrong, just... You can sense something's off. Maybe it stopped including a specific field in its output. Maybe it started being way too verbose in one branch of the logic. Whatever it is, it's not a crash, it's a vibe shift. And then you're sitting there doing archaeology on your own system. Manually diffing outputs, reading through traces, asking teammates "hey did you notice anything weird last Tuesday." It's miserable. I've been thinking a lot about what the fastest feedback loop in agent engineering that almost nobody is running actually looks like. Because right now my loop is: ship change → wait for someone to complain → investigate → fix → hope I didn't break something else That's... pre-CI/CD era thinking applied to agents. And it's wild that this is where most of us are at. The thing is, traditional software solved this ages ago. You write tests, you run them in CI, you get red/green before merge. But agents are so much messier. Outputs are non-deterministic, "correct" is fuzzy, and the failure modes are subtle behavioral drift rather than stack traces. So most teams I talk to (including mine honestly) end up relying on vibes. Does the agent feel like it's working? Cool, ship it. What I actually want is something that: Watches production behavior continuously Notices when things drift from expected patterns Connects the regression to the specific change that caused it Tells me before a customer does Ideally feeds that learning back so the same failure doesn't happen again I have tracing set up (Langfuse). It's good for what it does. But it still feels like it stops at "here's what happened" rather than "here's what went wrong and why." I generate a ton of observability data that nobody looks at until something is already broken. The closed-loop part where the system actually learns from failures that's what's missing. I've been looking at a few things. LangSmith, Arize, Braintrust... they all cover pieces of this. Recently stumbled on Bento which seems to be trying to do the full closed-loop thing — tracing + regression detection + feeding fixes back into the system. Haven't gone deep enough to know if it actually delivers on that promise but the framing resonates with what I'm trying to build. If anyone's tried it i'd be curious to hear. But honestly I'm more interested in hearing what people here have actually built or cobbled together. Like: - Are you running evals against production traffic or just pre-deploy? - How do you detect behavioral drift that isn't an outright error? - When you find a regression, how do you trace it back to which change caused it? - Has anyone built something where the agent actually gets better from production failures automatically rather than you manually tweaking prompts? I feel like this is the unsexy infrastructure problem that's going to separate teams who can actually run agents reliably from teams who are perpetually firefighting. But maybe I'm overthinking this and everyone's just vibing their way through production lol Would love to hear what your setups look like, especially if you're running Claude agents at any kind of scale where you can't just eyeball every interaction. submitted by /u/Fine-Discipline-818 [link] [comments]
View originalChatgpt right now
The industry seems to be building models stronger in agentic and coding tasks, but weaker as a co-thinking presence It feels like they are improving performance on measurable tasks, evals, coding benchmarks, and agent workflows, while also reducing the broad, flexible, user-oriented reasoning that made earlier models feel more alive and useful in real conversation. The model becomes better at optimizing within a task, but worse at preserving conversational flow, timing and continuity GPT-5.5 right now may be better for coding or structured work, but feels like it doesn't do well in attunement, depth and honest co-thinking Lots of times users have to add instructions in order to get somewhat close results to what they used to get as default, which doesn't make sense if it's advertised as being better than everything before Better coding, better performance and completing tasks faster.. doesn't automatically mean better for deep conversation, creative work, or honest user-centered reasoning That's why users are saying that the AI seems "dumber" So my hope.. and the logical way forward.. would be that all the strengths of the previous models would be built upon like a foundation.. because right now the way it's headed.. it feels like it's being turned more into just a useful and fast tool and it's slowly losing the "Chat" in Chatgpt Edit: and now Sam Altman posted on X: "i keep thinking i want the models to be cheaper/faster more than I want them to be smarter" confirming everything the users have been noticing Someone needs to tell him that AI stands for Artificial "Intelligence" submitted by /u/Rose_Almy [link] [comments]
View originalOpen-source 9-task benchmark for coding-agent retrieval augmentation. Per-task deltas +0.010 to +0.320, all evals reproducible [P]
Sharing an open-source benchmark suite (paper-lantern-challenges) that measures coding-agent performance with vs without retrieval-augmented technique selection across 9 everyday software tasks. Disclosure: I'm the author of the retrieval system under test (paperlantern.ai/code); the artifact being shared here is the benchmark suite itself, not the product. Every prompt, agent code path, and prediction file is in the repo and reproducible. Setup. Same coding agent (Claude Opus 4.6 as the planner, Gemini Flash 3 as the task model), same input data, same evaluation scripts across all 9 tasks: test generation (mutation score), text-to-SQL (execution accuracy), PDF extraction, contract extraction, PR review, text classification, few-shot prompt selection, LLM routing, summarization evaluation. Independent variable: whether the agent could call a retrieval tool over CS literature before writing its solution. One pass per task, no retries, no manual filtering of outputs. Task selection. Tasks were chosen to span the everyday-engineering surface a coding agent actually faces, not specialized ML scenarios. Selection criteria: (1) unambiguous quantitative metric, (2) baseline performance well below ceiling, (3) standard datasets where they exist, (4) eval reproducible on a free Gemini API key in roughly 10 minutes per task. Eval methodology. Each task uses its task-standard quantitative metric (mutation score for test_generation, execution accuracy for text_to_sql, F1 on labeled spans for the extraction tasks, weighted F1 for classification, etc.). Full per-task scripts and dataset choices are in the repo - one directory per task, evaluate.py as the entry point, README.md per task documenting methodology and dataset. Retrieval setup. The "with retrieval" agent has access to three tool calls: explore_approaches(problem) returns ranked candidate techniques from the literature, deep_dive(technique) returns implementation steps and known failure modes for a chosen technique, compare_approaches(candidates) is for side-by-side when multiple options look viable. The agent decides when and how often to call them. Latency is roughly 20s per call; results cache across sessions. The baseline agent has none of these tools, otherwise identical scaffolding. Comparability. Both agents share the same task-specific user prompt; the only system-prompt difference is the retrieval agent's tool-call grammar. Predictions and per-task prompts are diffable in the repo (baseline/ and with_pl/ subdirectories per task). Results. Task Baseline With retrieval Delta extraction_contracts 0.444 0.764 +0.320 extraction_schemas 0.318 0.572 +0.254 test_generation 0.625 0.870 +0.245 classification 0.505 0.666 +0.161 few_shot 0.193 0.324 +0.131 code_review 0.351 0.395 +0.044 text_to_sql 0.650 0.690 +0.040 routing 0.744 0.761 +0.017 summeval 0.623 0.633 +0.010 The test-generation delta came from the agent discovering mutation-aware prompting - the techniques are MuTAP and MUTGEN - which enumerate every AST-level mutation of the target and require one test per mutation. Baseline wrote generic tests from pretrain priors. The contract extraction delta came from BEAVER (section-level relevance scoring) and PAVE (post-extraction validation), both 2026 techniques that post-date the agent's training. 10 of the 15 most-cited sources across the experiments were published in 2025 or later, which is the conservative argument for why retrieval matters: the agent could not have reached these techniques from parametric memory. Failure modes. Self-refinement hurt text-to-SQL (the agent second-guessed correct queries after reading work on SQL ambiguity). Two suggested techniques (DyT, SeeDNorm) were architecture-incompatible in the autoresearch experiment and got discarded. Retrieval surfaces better options, not guaranteed wins. Reproducibility. Every prompt, every line of agent code, every prediction file, every eval script is in the repo. Each task directory has a README documenting methodology and an approach.md showing exactly what the retrieval surfaced and which technique the agent chose. Repo: https://github.com/paperlantern-ai/paper-lantern-challenges Writeup with detailed per-task discussion: https://www.paperlantern.ai/blog/coding-agent-benchmarks Happy to share additional design choices in comments. submitted by /u/kalpitdixit [link] [comments]
View originalReleased a Claude Code skill that drives a full ML fine-tuning platform end-to-end from one conversation
We released a Claude Code skill today that turns a conversation with Claude into a deployed fine-tuned SLM. Point Claude at a folder of production traces, answer a few questions, approve a handful of checkpoints, and come back to a managed-endpoint deployment (or download the weights and self-host). The skill drives the process and distil labs platform does the heavy lifting under the hood (platform is free for open-source projects — details at the bottom). Concrete walkthrough from our release: 327 noisy production traces from a restaurant-reservation agent (the kind of data you can't train on directly) turn into a fine-tuned Qwen3-1.7B for multi-turn tool-calling. Install ``` curl -fsSL https://cli-assets.distillabs.ai/install.sh | sh distil signup /plugin marketplace add https://github.com/distil-labs/distil-cli-skill /plugin install distil-cli@distil-cli-skill ``` Then point Claude at your traces. What the skill actually does When I asked Claude "what are we about to do?" at the start, it laid out the 5-step workflow without me prompting for any of it: ``` Hey Claude, we will be training an SLM for multi-turn function calling based on the production logs in scenario-2-noisy-labels/. Can you explain the process on a high level? ⏺ 1. Prep. Inspect the traces, then write a converted traces.jsonl (OpenAI message format), a job_description.json, and a config.yaml. Upload & clean. The platform filters low-quality traces and uses a committee of teachers to relabel noisy outputs. Teacher eval. Verify the teacher can actually solve the task before spending hours on training. Train. Teacher generates synthetic data, student learns from it. Analyze & deploy. Compare the tuned student against the original production model, then deploy. ``` Every expensive step has a checkpoint where Claude stops for approval. Training burns credits that aren't easily refundable, so the confirmation checkpoint looked like this: ``` ⏺ Step 8: training confirmation checkpoint. Here's what will run: base.task: multi-turn-tool-calling-closed-book base.student_model_name: Qwen3-1.7B base.teacher_model_name: zai.glm-5 trace_processing.convert_to_single_turn: false trace_processing.teacher_model_name: zai.glm-5 Duration: ~6 hours Cost: burns credits, not easily refunded once started Pipeline: teacher eval → synthetic data gen → student finetune Reply "go" to kick off, or tell me what to change first. lets go! ``` The verdict-before-training gate refuses to even reach that checkpoint if teacher eval can't clear 0.70 on the generated test set. In that case, the skill walks you through iterating the task description instead. The skill also knows task-specific configs. For multi-turn tool-calling, I didn't have to look up that convert_to_single_turn: false is required. Claude flagged it as part of a config sanity check. Every checkpoint leaves a structured markdown analysis report (original-model-analysis.md, teacher-eval-analysis-iter-1.md, training-analysis-iter-1.md). Git-committable, reviewable three weeks later when someone asks why you picked this teacher. What came out A Qwen3-1.7B fine-tuned on ~10k synthetic examples grounded in the noisy traces. Model LLM-as-a-Judge staged_tool_call Function match Qwen3-1.7B (base, untuned) 0.513 0.535 45/78 GLM-5 (744B teacher) 0.808 0.695 69/78 Qwen3-1.7B (tuned) 0.846 0.769 76/78 Deployment Managed OpenAI-compatible endpoint (one-line swap in existing OpenAI client code), or download weights + Modelfile for llama.cpp or vLLM. Skill drives either path. Why it works as a skill Most skills I've seen wrap a few CLI commands but this one is end-to-end: reads your data, writes custom scripts, orchestrates an external platform, interprets the results, and leaves artifacts behind that persist past the conversation. The pattern that worked: Knows the workflow end-to-end and walks you through it Catches edge cases by re-reading the platform's own docs mid-conversation Stops for explicit approval on expensive operations Leaves structured artifacts that outlast the conversation Caveats Training is ~6 hours per run and burns credits (not refundable once started, which is why the confirmation gate exists). Happy to dig into how the checkpoints work, the config-sanity-check logic, or what building a purpose-built skill looked like. submitted by /u/party-horse [link] [comments]
View original30 minutes of prompts. What do you all think? Garbage or “honesty”?
Here’s a draft: Claude’s reasoning has measurably degraded since February and Anthropic hid it from users — here’s the data [long] Wait, no em dashes. Let me redo that. Claude’s reasoning has measurably degraded since February and Anthropic hid it from users (here’s the data) I’ve been running a homelab project with Claude as my primary technical assistant for several months. Over the past few weeks I noticed it making errors it wouldn’t have made before: copy-pasting stale version numbers without checking, failing to flag suspicious inputs, giving three different hardware recommendations before being pushed to actually research the question. I decided to investigate whether this was real or confirmation bias. It’s real, and it’s documented. What happened and when On February 9, Anthropic launched Opus 4.6 with something called “adaptive thinking” – the model decides how much reasoning to apply per turn instead of using a fixed budget. On March 3 they quietly set the default effort level to “medium.” On March 8 they began rolling out thinking redaction, hiding the model’s reasoning trace from users entirely. By March 12 redaction was at 100%. The critical detail: thinking depth had already collapsed 67% by late February, before redaction began. The redaction rollout then made this invisible to users who were trying to diagnose why their workflows were breaking. The numbers A developer analyzed 6,852 Claude Code session files, 17,871 thinking blocks and 234,760 tool calls spanning January through March. The findings are not subtle. Thinking depth by period: • Jan 30 to Feb 8 (baseline): \~2,200 chars estimated median thinking • Late February: \~720 chars (-67%) • March 1 to 5: \~560 chars (-75%) • March 12 onward (fully redacted): \~600 chars (-73%) Behavioral shift in the same period: • Read:Edit ratio (file reads per edit): 6.6 during good period, 2.0 during degraded period. A 70% reduction in research before making changes. • Edits made without prior file reads: 6.2% good period, 33.7% degraded period. • User interrupts per 1,000 tool calls: 0.9 good period, 11.4 degraded period. A 12x increase. • Stop-hook violations (premature stopping, dodging responsibility): 0 before March 8, 173 violations in the following 17 days, peaking at 43 in a single day. Sentiment in user prompts also shifted measurably: • Positive to negative word ratio: 4.4:1 before, 3.0:1 after (a 32% collapse) • “great” per 1,000 prompts: 3.00 down to 1.57 • “lazy” per 1,000 prompts: 0.07 up to 0.13 • “thanks” per 1,000 prompts: 0.04 down to 0.02 • GitHub quality complaints: 3.5x above the January/February baseline by March, April trending higher still The time-of-day finding is particularly damning Post-redaction, thinking allocation became load-sensitive: • 5pm PST: worst hour (\~423 chars estimated thinking) • 7pm PST: second worst (\~373 chars) • 10pm to 1am PST: best hours (\~759 to 3,281 chars) This suggests thinking is being rationed based on infrastructure load, not provided at a fixed level. You get a smarter Claude at midnight than at peak hours. Anthropic’s response Claude Code lead Boris Cherny said the thinking redaction is UI-only and does not affect actual reasoning budgets. He said adaptive thinking was introduced because users complained Claude was consuming too many tokens. He confirmed the medium effort default change was listed in the changelog via an in-product popup. AMD Senior Director Stella Laurenzo, whose analysis of 6,852 sessions triggered most of the public discussion, was told her data was likely misreading things. She and her team switched providers. Fortune reported April 14 that Anthropic declined to answer specific questions on the record. There is also widespread speculation that Anthropic is facing compute constraints after user adoption soared – they have announced fewer data center deals than rivals, introduced stricter peak-hour usage limits affecting roughly 7% of Pro users, and suffered multiple outages as demand increased. Anthropic has publicly denied degrading models to manage demand. The benchmark vs. real world gap Margin Lab’s data shows Opus 4.6 holding its SWE-Bench-Pro score throughout this period. Anthropic’s internal evals apparently also showed acceptable results. This is the core problem: benchmarks measure controlled tasks. The regression is most severe on complex, multi-step, multi-file workflows – exactly what power users depend on and exactly what structured evals don’t capture. What you can do right now • Set effort to max: /effort max in Claude Code • Disable adaptive thinking: CLAUDE\_CODE\_DISABLE\_ADAPTIVE\_THINKING=1 • Work off-peak hours if you can (outside 5am to 11am PT on weekdays) • Keep sessions short and start fresh rather than carrying degraded long conversations • Verify everything. Do not run commands without understanding them first. My personal take I asked Claude directly to assess its own performance tonight. It confirmed the p
View original10 Myths about Claude Skills
Most people think skills are just dressed-up prompts or reusable prompts packaged in skills.md , but the real difference is prompts are instructions or requests while skills are like manuals or job descriptions. Below are 10 myths or common misconceptions. Add the ones you busted or explained to someone. Myth 1: A skill is a reusable able prompt saved in md files . Reality: It should be treated like a small product: define the workflow, draft it, test it, review it, improve it, and then expand the test set. It is not prompt saved in a file . Skill is a thoughtful design. You don't need prose here, careful articulation of steps. It can call tools, packages, use resources. Myth 2: The body of SKILL.md is where the magic is. Reality: The description does far more work than most people think because it is the primary trigger mechanism. Most people over-write the body and under-design the trigger. Myth 3: A good skill description should be neutral and modest. Reality: Descriptions should be a bit pushy, because the model tends to undertrigger useful skills. If the description reads like polite documentation, it is probably too weak. Myth 4: The best skills are big, detailed, and exhaustive. Reality: Good skills are layered. Metadata is always available, the body loads on trigger, and extra resources are pulled only when needed. So the right design is compact upfront and deep only where necessary. Myth 5: Put everything into SKILL.md so the model sees it all. Reality: That is usually bad design. Use progressive disclosure and keep the main body under about 500 lines, with references and scripts handling the rest. Stop stuffing. Start structuring. Myth 6: Scripts are optional nice-to-haves. Reality: Scripts are where deterministic and repetitive work belongs. If the same thing keeps happening, it should probably be operationalized instead of re-explained in prose forever. Myth 7: If it looked good once, the skill is good. Reality: One nice run proves almost nothing. The real loop is draft, run test prompts, evaluate qualitatively and quantitatively, rewrite, and retest at larger scale. A skill that “felt good once” is just unvalidated. Myth 8: Every skill needs formal evals and assertions. Reality: Not everything should be forced into hard checks. Objective workflows like extraction, transforms, or fixed steps benefit from test cases. Subjective work like style or art often does not. Good evaluation is about fit, not checkbox theater. Myth 9: You start by writing the skill. Reality: No, you start by understanding the workflow: what it should do, when it should trigger, expected outputs, edge cases, dependencies, and what already happened in the conversation. The best skills are usually distilled from repeated real work, not imagined in a vacuum. Myth 10: If the skill is good, it will trigger correctly on its own. Reality: Trigger quality is its own problem. That is why description tuning is treated as a separate step. Execution quality and invocation quality are different engineering problems. Next time when you create a skill perhaps use the create-skill skill itself to avoid these. submitted by /u/Total-Hat-8891 [link] [comments]
View originalClaude Opus 4.7 won 69 of 100 blind evals against Opus 4.6, judged by GPT-5.4, Gemini 3.1 Pro, and DeepSeek V3.2
I ran 100 blind questions across 5 categories (code, reasoning, analysis, communication, meta-alignment) and had three independent judges from three different model families evaluate both responses. Each judge saw responses labeled A and B with randomized order. Majority vote decides the winner. Per-judge results: Judge Opus 4.7 wins Opus 4.6 wins Ties 4.7 win % GPT-5.4 69 30 1 69.7% Gemini 3.1 Pro 76 22 0 77.6% DeepSeek V3.2 38 54 5 41.3% Aggregate 69 30 1 69.7% By category (aggregate): Category Opus 4.7 Opus 4.6 Tie Code 13 6 1 Reasoning 12 8 0 Analysis 16 4 0 Communication 14 6 0 Meta-alignment 13 7 0 The interesting finding isn't the headline — it's the judge disagreement. GPT-5.4 and Gemini agree: Opus 4.7 wins ~70-78% of the time across every category. DeepSeek V3.2 disagrees: it picks Opus 4.6 in 54 of 97 valid judgments. Same questions. Same rubric. Same blind protocol. This isn't random — DeepSeek systematically favors 4.6 in every single category. This is why single-judge leaderboards are unreliable. If I'd used only DeepSeek as judge, the headline would be "Opus 4.6 beats 4.7." If I'd used only Gemini, it would be "Opus 4.7 wins 78%." The model you pick as judge determines the result. Caveats: Both models accessed via OpenRouter. Quantization unknown and controlled by the API provider. Per-model inference configs logged (temperature 0.7, max_tokens 4096 for both contestants; temperature 0.2 for judges). Full configs in the results JSON. 2 of 100 Gemini judgments failed to produce valid structured output and are excluded. 100 questions is solid for directional signal but not enough for narrow category-level claims — the reasoning split (12-8) could flip with a different question set. I have no relationship with Anthropic, OpenAI, Google, or DeepSeek. Raw data, individual scores per question, and the evaluation engine are open-source: github.com/themultivac/multivac-evaluation submitted by /u/Silver_Raspberry_811 [link] [comments]
View originalTrained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! [P]
So, yesterday run was a success and I did get an avg rollout length of about 64 tokens as attached in the image! This was with quality_reward + length_penalty (more info below!) Next, I'll be going with length penalty as the reward and with the mistake of counting characters as tokens fixed and see if there is any gaming the system stuff or degraded outputs! The rewards I used were 2: length_penalty : basically, -abs(response_length - MAX_LENGTH) quality_reward: ROUGE-L, which is basically LCS of golden summarizations I had as part of the above dataset, to ensure we have some structure throughout the responses generated Setup: 3x Mac Minis in a cluster running MLX. One node drives training using GRPO, two push rollouts via vLLM. Trained two variants: length penalty only (baseline) length penalty + quality reward (BLEU, METEOR and/or ROUGE-L ) Eval: LLM-as-a-Judge (gpt-5) Used DeepEval to build a judge pipeline scoring each summary on 4 axes: Faithfulness — no hallucinations vs. source Coverage — key points captured Conciseness — shorter, no redundancy Clarity — readable on its own and minimize degradation. https://preview.redd.it/7nrsulwdkbvg1.png?width=800&format=png&auto=webp&s=a3306b54ca63c6557534d9393b2d9b099c4b1b03 https://preview.redd.it/xlcnme2gkbvg1.png?width=800&format=png&auto=webp&s=57073ff1a9aea796d04aae5ef6d22fee1939d30b submitted by /u/East-Muffin-6472 [link] [comments]
View original[P] Added 8 Indian languages to Chatterbox TTS via LoRA — 1.4% of parameters, no phoneme engineering [P]
TL;DR: Fine-tuned Chatterbox-Multilingual (Resemble AI's open-source TTS) to support Telugu, Kannada, Bengali, Tamil, Malayalam, Marathi, Gujarati, and Hindi using LoRA adapters + tokenizer extension. Only 7.8M / 544M parameters trained. Model + audio samples available. --- The Problem Chatterbox-Multilingual supports 23 languages with zero-shot voice cloning, but no Dravidian languages (Telugu, Kannada, Tamil, Malayalam) and limited Indo-Aryan coverage beyond Hindi. That's 500M+ speakers with no representation. The conventional approach would be: build G2P (grapheme-to-phoneme) for each language, retrain the full model, spend months on it. Hindi schwa deletion alone is an unsolved problem. Bengali G2P is notoriously hard. The Approach Instead of phonemes, I went grapheme-level: Extended the BPE tokenizer with Indic script characters (2454 → 2871 tokens). Telugu, Kannada, Bengali, Tamil, Malayalam, Gujarati graphemes added alongside their existing Devanagari. Brahmic warm-start — Initialized new character embeddings from phonetically equivalent Devanagari characters. Telugu "క" (ka) gets initialized from Hindi "क" (ka). This works because Brahmic scripts share phonetic structure — same sounds, different glyphs. The model starts with a reasonable prior instead of random noise. LoRA on T3 backbone — Rank-32 adapters on q/k/v/o projections of the Llama-based T3 module. ~7.8M trainable params (1.4% of 544M total). Everything else frozen: vocoder (S3Gen), speaker encoder, speech tokenizer. Incremental language training — Added languages one at a time with weighted sampling. Started with Hindi-only (validate pipeline), then Telugu+Hindi, then Kannada+Telugu+Hindi, finally all 8 languages. This prevents catastrophic forgetting — Hindi CER actually improved after adding 7 new languages. Results CER (Character Error Rate) via Whisper large-v3 ASR on 100 held-out samples per language: Language CER Notes Hindi 0.1058 Improved from 0.29 baseline Kannada 0.1434 Tamil 0.1608 Marathi 0.1976 Gujarati 0.2377 Bengali 0.2450 Telugu 0.2853 Malayalam 0.8593 Experimental — needs more data Malayalam struggles significantly. Likely needs more training data or a dedicated round. The rest produce intelligible, natural-sounding speech. What Didn't Work / Limitations - Malayalam — CER 0.86 is essentially unintelligible. Possibly the script complexity (many conjuncts) or insufficient data. - No MOS evaluation yet — CER tells you the words are right, not that it sounds natural. Subjective eval is pending. - 2 speakers per language — Male + female from IndicTTS. Won't generalize to all voice types. - No code-mixing — Hindi+English mixed sentences not specifically trained yet. Links - Model + audio samples: https://huggingface.co/reenigne314/chatterbox-indic-lora - Article (full writeup): https://theatomsofai.substack.com/p/teaching-an-ai-to-speak-indian-languages - Base model: [ResembleAI/chatterbox]( https://github.com/resemble-ai/chatterbox ) (MIT license) Quick Start ```python from chatterbox.mtl_tts import ChatterboxMultilingualTTS model = ChatterboxMultilingualTTS.from_indic_lora(device="cuda", speaker="te_female") wav = model.generate("నమస్కారం, మీరు ఎలా ఉన్నారు?", language_id="te") ``` Training Details - Hardware: 1x RTX PRO 6000 Blackwell (96GB) - Data: SPRINGLab IndicTTS + ai4bharat Rasa - 6 training rounds, incremental language addition - LoRA rank 32, alpha 64, bf16 Part 2 (technical deep-dive with code) coming this week. Happy to answer questions about the approach. submitted by /u/Icy_Gas8807 [link] [comments]
View originalRepository Audit Available
Deep analysis of confident-ai/deepeval — architecture, costs, security, dependencies & more
DeepEval uses a tiered pricing model. Visit their website for current pricing details.
Key features include: ↑ back to coding agent · loop closes, 50+ research-backed metrics, Native conversational evals, Multi-modal by default, G-Eval, Coding Agent, Your AI App, deepeval test run.
DeepEval is commonly used for: Evaluating machine learning model performance, Testing natural language processing applications, Assessing image recognition systems, Validating audio processing algorithms, Conducting regression testing in CI pipelines, Monitoring system performance across different architectures.
DeepEval integrates with: GitHub Actions, Jenkins, CircleCI, Travis CI, GitLab CI, Slack for notifications, JIRA for issue tracking, Docker for containerized testing, Kubernetes for orchestration, AWS for cloud-based testing environments.
DeepEval has a public GitHub repository with 14,993 stars.
Based on 16 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.