Build invincible apps with Temporal
Users largely praise Temporal for its innovative approach to structuring human-AI interactions, highlighting the absence of autonomous decision-making as a strength. However, there are substantial concerns around its security vulnerabilities, with recent incidents pointing out the exploitation of Temporal Trust Gaps. The sentiment around pricing is not clearly addressed in the mentions. Overall, Temporal has a mixed reputation; it is acknowledged for its unique functionality but criticized for inadequate security measures.
Mentions (30d)
26
Reviews
0
Platforms
3
GitHub Stars
19,256
1,436 forks
Users largely praise Temporal for its innovative approach to structuring human-AI interactions, highlighting the absence of autonomous decision-making as a strength. However, there are substantial concerns around its security vulnerabilities, with recent incidents pointing out the exploitation of Temporal Trust Gaps. The sentiment around pricing is not clearly addressed in the mentions. Overall, Temporal has a mixed reputation; it is acknowledged for its unique functionality but criticized for inadequate security measures.
Features
Use Cases
Industry
information technology & services
Employees
350
Funding Stage
Series D
Total Funding
$754.5M
2,991
GitHub followers
196
GitHub repos
19,256
GitHub stars
20
npm packages
December 22, 2025
*David Sathuluri is a Research Associate and Dr. Marco Tedesco is a Lamont Research Professor at the Lamont-Doherty Earth Observatory of Columbia University.* **As climate scientists warn that we are approaching irreversible tipping points in the Earth’s climate system, paradoxically the very technologies being deployed to detect these tipping points – often based on AI – are exacerbating the problem, via acceleration of the associated energy consumption.** The UK’s much-celebrated £81-million ($109-million) [Forecasting Tipping Points programme](https://www.theguardian.com/environment/2025/feb/18/early-warning-system-for-climate-tipping-points-given-81m-kickstart) involving 27 teams, led by the Advanced Research + Invention Agency (ARIA), represents a contemporary faith in technological salvation – yet it embodies a profound contradiction. The ARIA programme explicitly aims to “harness the laws of physics and artificial intelligence to pick up subtle early warning signs of tipping” through advanced modelling. We are deploying massive computational infrastructure to warn us of climate collapse while these same systems consume the energy and water resources needed to prevent or mitigate it. We are simultaneously investing in computationally intensive AI systems to monitor whether we will cross irreversible climate tipping points, even as these same AI systems could fuel that transition. ## The computational cost of monitoring Training a single large language model like GPT-3 consumed approximately 1,287 megawatt-hours of electricity, resulting in 552 metric tons of carbon dioxide – equivalent to driving 123 gasoline-powered cars for a year, according to a recent [study](https://arxiv.org/ftp/arxiv/papers/2104/2104.10350.pdf). GPT-4 required roughly [50 times](https://www.weforum.org/stories/2024/07/generative-ai-energy-emissions/) more electricity. As the computational power needed for AI continues to double approximately every 100 days, the energy footprint of these systems is not static but is exponentially accelerating. > **[UN adopts first-ever resolution on AI and environment, but omits lifecycle](https://www.climatechangenews.com/2025/12/12/un-adopts-first-ever-resolution-artificial-intelligence-ai-environment-lifecycle-unea/)** And the environmental consequences of AI models extend far beyond electricity usage. Besides massive amounts of electricity (much of which is still fossil-fuel-based), such systems require advanced cooling that consumes enormous quantities of water, and sophisticated infrastructure that must be manufactured, transported, and deployed globally. ## The water-energy nexus in climate-vulnerable regions A single data center can consume up to [5 million](https://utulsa.edu/news/data-centers-draining-resources-in-water-stressed-communities/#%3A%7E%3Atext=Unfortunately%2C+many+data+centers+rely+on+water-intensive%2Cto+supply+thousands+of+households+or+farms.) gallons of drinking water per day – sufficient to supply thousands of households or farms. In the Phoenix area of the US alone, more than [58 data centers](https://utulsa.edu/news/data-centers-draining-resources-in-water-stressed-communities/) consume an estimated 170 million gallons of drinking water daily for cooling. The geographical distribution of this infrastructure matters profoundly as data centers requiring high rates of mechanical cooling are disproportionately located in water-stressed and socioeconomically vulnerable regions, particularly in Asia-Pacific and Africa. At the same time, we are deploying AI-intensive early warning systems to monitor climate tipping points in regions like Greenland, the Arctic, and the Atlantic circulation system – regions already experiencing catastrophic climate impacts. They represent thresholds that, once crossed, could trigger irreversible changes within decades, scientists have warned. > **[Nine of our best climate stories from 2025](https://www.climatechangenews.com/2025/12/22/nine-of-our-best-climate-stories-from-2025/)** Yet computational models and AI-driven early warning systems operate according to different temporal logics. They promise to provide warnings that enable future action, but they consume energy – and therefore contribute to emissions – in the present. This is not merely a technical problem to be solved with renewable energy deployment; it reflects a fundamental misalignment between the urgency of climate tipping points and the gradualist assumptions embedded in technological solutions. The carbon budget concept reveals that there is a cumulative effect on how emissions impact on temperature rise, with significant lags between atmospheric concentration and temperature impact. Every megawatt-hour consumed by AI systems training on climate models today directly reduces the available carbon budget for tomorrow – including the carbon budget available for the energy transition itself. ## The governance void The deeper issue is that governance frameworks
View originalPricing found: $1,000, $100/mo, $500/mo, $30, $6,000
Why Claude Code forgets your stack and how to fix it
Karpathy's "Claude 4 Rules" post points out the biggest pain point for Claude Code: every session starts with a blank slate. The model has no memory of the project's stack, the design decisions you made last week, or the dead-ends you already explored. I ran into the same issue on a 87-file codebase (163 122 tokens). Feeding the same files directly to Claude Code cost roughly 163 000 tokens. After adding the engramx Skill Pack (v4.0.0) the token count dropped to 17 722. That's an 89.1 % reduction, or about 6.4 times fewer tokens than reading only the relevant files, and 25, 155 times fewer than scanning the whole repo. The reduction comes from three things. First, engramx builds a bi-temporal knowledge graph from your git history. A git-revert miner automatically captures revert commits during indexing, so you get a curated mistakes corpus without any manual effort. Second, bi-temporal mistakes now fire as PreToolUse hooks on Edit, Write, and Bash actions. The model sees the mistake before it retries, so it can avoid repeating it. Third, engram init installs six Sentinel hooks by default (PreToolUse on Edit/Write/Bash, PostToolUse, SessionStart, PreCompact). No extra config needed. I ran the full test suite after installing engramx-skill-pack@0.2.0 from npm. All 1 025 engramx tests and 36 skill-pack tests passed. The package is Apache 2.0, zero cloud calls, and stores its graph in a local SQLite file. Install with `npx engramx@4.0.0`. The repo is on GitHub (https://github.com/NickCirv/engram). The README includes an asciinema demo (https://asciinema.org/a/GjjvPXVyArnivAog). In the last week npm reported 213 downloads, about 30 per day, which suggests a modest but growing user base. What strategies have you tried to give Claude Code a persistent context, and how did they compare to this approach? submitted by /u/SearchFlashy9801 [link] [comments]
View originalmemv ships an MCP server — OSS memory layer for agents, now usable from any MCP client
memv (OSS, Python) gained an MCP server today. If you're building on Claude Desktop / Code / Cursor — or your own MCP host — you get persistent, structured memory without writing integration code. bash pip install "memvee[mcp]" memv-mcp --db-url memory.db --llm-model openai:gpt-4o-mini Or mount it inside your own process: ```python from memv.mcp.server import create_server server = create_server( db_url="memory.db", default_user_id="alice", embedding_client=my_embedder, llm_client=my_llm, ) server.run(transport="streamable-http") ``` Surface: - 5 MCP tools: search_memory, add_memory, add_conversation, list_memories, delete_memory - LLM optional — retrieval/add work LLM-free; only add_conversation extraction needs one - Per-user isolation at every tool boundary, including delete_memory ownership check - Concurrent extractions for the same user coalesce onto one task For context if you haven't seen memv before: predict-calibrate extraction (Nemori-inspired) so we don't store everything, bi-temporal model so contradictions expire instead of overwriting, hybrid retrieval (vector + BM25 + RRF). Docs: https://vstorm-co.github.io/memv/advanced/mcp-server/ GitHub: https://github.com/vstorm-co/memv submitted by /u/brgsk [link] [comments]
View original#1 on memory benchmark LongMemEval with Gemini Flash, not Pro [R]
Disclosure: first author. Evaluation of an experimental memory retrieval system against LongMemEval (Wang et al., 2024). Figured the results might be of interest here, particularly the deliberate use of a smaller answering model to isolate retrieval quality from model capability. 96.4% at top-50 with Gemini 3 Flash. Comparative reported scores (all Gemini 3 Pro): Mem0 94.8%, Honcho 92.6%, HydraDB 90.79%, Supermemory 85.2%. Retrieval architecture draws on episodic memory theory (Tulving, 1972), reconstructive recall (Bartlett, 1932), and temporal context models (Howard & Kahana, 2002). Three design choices we think mattered: Query decomposition: parallel retrieval passes targeting distinct information needs. Critical for multi-session questions where no single query surfaces all relevant fragments. Temporal salience scoring: candidates scored on semantic similarity, lexical precision, and temporal salience, reflecting associative and recency factors in human recall (Polyn et al., 2009). Coherence re-ranking: re-ranked for cross-memory coherence and temporal chain resolution before presentation to the answering model. Methodology: forked Mem0's open-source benchmarking script, replaced storage and retrieval with our system, stripped all question-specific prompt templates. Single generic prompt, 500 questions. Category results at top-50: single-session (user) 98.6%, assistant 100%, preferences 96.7%, knowledge update 97.4%, multi-session 94.0%, temporal reasoning 95.5%. Limitations: single benchmark evaluation; architecture details intentionally limited; single model configuration, no ablations; production conditions (adversarial inputs, privacy, contradictory information) not tested. Above ~96% we hit evaluation ceiling effects: ambiguous questions, narrow expected answers, dataset inconsistencies. Some benchmark errors identified, which we reported upstream. Paper | Results | Answerer prompt Curious if others have explored similar cognitive-science-informed retrieval architectures for conversational memory. submitted by /u/j-m-k-s [link] [comments]
View originalI have figured out a way to run every memory system out there on one platform
But is there an industry need for it ... It's smth like vlc media player of memory systems ... My team thinks it's hard to make money from it or its hard to sell ... What do y'all think In this system it's like you can fetch like zep for your temporal needs , store like letta if needed , traverse like mempalace or hindsight etc all in one place Thoughts? submitted by /u/boneMechBoy69420 [link] [comments]
View originalI have figured out a way to run every memory system out there on one platform
But is there an industry need for it ... It's smth like vlc media player of memory systems ... My team thinks it's hard to make money from it or its hard to sell ... What do y'all think In this system it's like you can fetch like zep for your temporal needs , store like letta if needed , traverse like mempalace or hindsight etc all in one place Thoughts? submitted by /u/boneMechBoy69420 [link] [comments]
View originalGPT 5.5 (Codex) leading the future prediction race
Researchers from the Max Planck Institute recently released FutureSim, an environment in which agents are replayed a temporal slice of the web and are tasked with predicting real-world future events. In their environment, GPT 5.5 leads at 25% acc, followed by Opus 4.6 at 20%. Open weight frontier models have a significant gap to catch up, with DeepSeek V4 pro at 13%, GLM 5.1 at 10%, and Qwen3.6 Plus at 5%. They say they evaluate with native harnesses (Codex, CC, etc). On some questions that have a parallel r/Polymarket market, GPT 5.5 in their simulation sometimes beats the crowd aggregate, like in the Super Bowl LX ($704M traded) market, which I think is pretty promising (and surprising). OpenAI really cooked with GPT 5.5 (and Codex) this time! Wonder how the trading market could evolve as models keep improving. submitted by /u/viciousA3gis [link] [comments]
View originaltemporal-mcp: wall-clock awareness for LLMs, with OAuth
One of the small failure modes I keep hitting with agent stacks is that the model has no idea how much time passed between turns. It'll greet you with "good morning" at 11 PM, or pick up a conversation three weeks later as if no time has passed, or compute "today's data" off whatever fragment of context happens to be in scope. Built a minimal MCP server to fix it. Two tools: temporal_tick and temporal_peek. They return elapsed-time-since-last-turn, day-rollover detection, and a fresh-thread flag, both as a human-readable header and as JSON. Ways to use: Local stdio: pip install temporal-mcp (works with Claude Desktop, Cursor, Cline, Zed, Claude Code) Hosted with OAuth (claude.ai / ChatGPT): visit https://temporal-mcp.dev/connect, click "Generate OAuth Credentials", paste into your custom connector. Full OAuth 2.0 with PKCE and refresh tokens, but no signup, the credential pair is the identity. (Verified working in claude.ai) Hosted with raw bearer (any client that supports custom headers): Authorization: Bearer against https://temporal-mcp.dev/mcp. The token gets SHA-256'd; we never see the plaintext. Self-host: Cloudflare Workers deploy in workers/ in the repo, free tier covers ~100k req/day. Grok/xAI: https:temporal-mcp.dev/mcp/ (Verified working in Grok) MIT, ~150 lines of stdlib Python on the local side, ~400 lines of TypeScript on the hosted side (engine + OAuth provider), both with tests. Listed in the official MCP Registry. Smithery and Glama submissions in flight. Curious to hear how folks would use the JSON day_rollover and delta_sec signals I've been using them for context decay and resume detection but there are probably more interesting use cases. Source: github.com/MirrorEthic/temporal-mcp submitted by /u/MirrorEthic_Anchor [link] [comments]
View originalCFS-R: Conditional Field Reconstruction
I evaluated CFS-R on LoCoMo (1,982 questions, same setup as the CFS evaluation), holding cosine and BM25 fixed and varying only the third leg. baseline cosine top-10: NDCG@10 0.5123, Recall@10 0.6924 rrf(cos, BM25): NDCG@10 0.5196, Recall@10 0.6989 rrf(cos, BM25, MMR tuned): NDCG@10 0.5330, Recall@10 0.7228 rrf(cos, BM25, CFS-long): NDCG@10 0.5362, Recall@10 0.7295 rrf(cos, BM25, CFS-R top50 w3): NDCG@10 0.5447, Recall@10 0.7303 Against tuned MMR: +1.17 pp NDCG@10 (95% CI [+0.66, +1.69], p < 0.001). Against CFS-long: +0.85 pp NDCG@10 (95% CI [+0.33, +1.35], p = 0.0006). Against baseline cosine: +3.24 pp NDCG@10, +3.79 pp Recall@10. The sweep wasn’t fragile.. the top configurations clustered tightly between 0.5441 and 0.5447 NDCG@10, which means the operator is on a stable plateau rather than a single magic hyperparameter. The category breakdown is where the conceptual difference shows up: single-hop multi-hop temporal open-dom adversarial tuned MMR 0.3479 0.6377 0.2938 0.6144 0.4705 CFS-long 0.3615 0.6376 0.2959 0.6157 0.4734 CFS-R top50 w3 0.3646 0.6344 0.2948 0.6209 0.5018 The adversarial line is the result that matters: +3.13 pp over tuned MMR, +2.84 pp over CFS-long. If the adversarial problem were only pairwise diversity, MMR should be very hard to beat but it isn’t. That supports the main claim: long-memory retrieval is not just about avoiding similar chunks. It is about reconstructing the evidence behind the query. Temporal is no longer a glaring weakness either, CFS-long still slightly leads, but CFS-R has closed the gap while keeping the adversarial gains. https://gist.github.com/M-Garcia22/542a9a38d93aae1b5cf21fc604253718 submitted by /u/mauro8342 [link] [comments]
View originalHow do you reliably override a model's internal temporal bias in production ?
I'm building an automated mail generation pipeline using Claude Haiku 4.5 OnPremise but the knowledge cutoff June 2025. This model needs to handle temporal expressions correctly like : next Monday end of the week this month 16 May 16 May 2026 25/05/2026 for deal with this cutoff I'm injecting a full temporal context block in the system prompt, covering today, yesterday, tomorow, ... I also added few-shot examples and a CoT reasoning step to reinforce the behavior. **IMPORTANT**: Today is {today_formatted} of {year}. Any date without an explicit year refers to {year}, NEVER to 2025 or any other year. You know the exact calendar: number of days per month, days of the week, valid dates You correctly interpret relative dates (“this Monday,” “next Thursday,” “next week,” etc.) You must CORRECTLY convert all relative dates to absolute dates (e.g., “tomorrow” -> “{tomorrow}”) The day and date must ALWAYS match (e.g., do not write “Friday, July 15” if it is a “Tuesday”) Today is {today_formatted} Yesterday was {yesterday} Tomorrow will be {tomorrow} Next Monday will be {next_monday} Next Tuesday will be {next_tuesday} Next Wednesday will be {next_wednesday} Next Thursday will be {next_thursday} Next Friday will be {next_friday} Next Saturday will be {next_saturday} Next Sunday will be {next_sunday} The end of the current week is {end_of_week_formatted} Next week begins on {next_week_start} and ends on {next_week_end} The end of the month is {end_of_month_formatted} Next month will be {next_month}, which begins on {next_month_start} and ends on {next_month_end} This year is {year}. Any date without an explicit year belongs to {year} unless otherwise specified. It works most of the time, but Haiku still occasionally falls back on its training time temporal bias defaulting to 2025, especially on ambiguous formart ike 18/05/2026 or dates that predate the current month (this one is not really a big deal). e.g: “mail_body”: “Hello, Following up on our conversation on Tuesday, April 28, I am confirming your appointment for 05/18/2026, at 10:30 a.m. with Ms. Chloe Berliat. Thank you in advance for your assistance. Best regards,” “user_input”: “I'm confirming the 10:30 a.m. appointment with Ms. Chloe Berliat” “suggested_response”: "Hello Mr., I am writing to confirm your appointment scheduled for Sunday, May 18, 2026, at 10:30 a.m. with Ms. Chloe Berliat. Best regards," May 18 is a Monday in 2026, but a Sunday in 2025, even if I set the time context dynamically, about 70% of the time the system defaults to the 2025 calendar. The only way to work around this is to explicitly specify the day in the user_input. What I've tried ? Applicative date normalization before injection as a partial mitigation but i find this britlle given the diversity of date formats users can input. Few-shot + CoT Explicit prohibition rules on internal temporal reasoning So i want to know if there is a prompting pattern that more reliably forces the model to treat injected context as ground truth ? Any feedbacks are welcome 😉 submitted by /u/Imaginary-Result-828 [link] [comments]
View originalI built an autonomous engineering agent on top of Claude Code. Self-improving routing, cross-session memory, process intelligence, P2P team learning.
Some of you might remember my posts about claude-bootstrap (v3.6 was the last one — cross-agent intelligence). I skipped v4 entirely because v5 shipped days later. What started as an opinionated Claude Code setup has become something fundamentally different. The problem I'm solving: Every AI coding tool today is an amnesiac. When a session ends, everything the agent learned — project conventions, reviewer preferences, codebase idioms — evaporates. The next session starts from scratch. And if you use multiple AI tools across projects, you have zero unified visibility into what's happening. I think the industry is converging on a spectrum: Level 0: Autocomplete (Copilot, TabNine) Level 1: Chat Assistant (ChatGPT, Claude) Level 2: Project-Aware Assistant (Cursor, Continue) Level 3: Task Agent (Devin, Claude Code Agent) Level 4: Autonomous Engineering Platform (Maggy) ← this is what I built The difference at Level 4: multi-model orchestration, self-improvement from every task, process intelligence that learns from CI/reviews/deploys, cross-session memory, and P2P team learning. What Maggy actually does Chat — Session Takeover: Auto-detects all running Claude Code sessions across your projects. Shows session history, prompt counts, duration. You can `--resume` into any session from the dashboard. Right now I have 7 active sessions across 4 projects visible at a glance. Task Triage: Connects to GitHub Issues and Asana. AI-ranks tasks by priority. One-click "Plan" or "Execute" buttons that spawn the right CLI with codebase context pre-injected from an intent code property graph (iCPG). Process Intelligence: This is the part most tools completely ignore. Maggy collects signals from the full SDLC — CI results, PR review comments, CodeRabbit findings, merge patterns, deploy results. It learns which code patterns cause test failures, what reviewers consistently flag, and preemptively fixes issues before they reach reviewers. > "Your reviewer always flags missing error handling in API routes. Maggy added it before the PR was created." That's not prompt engineering. That's autonomous process optimization. Cross-Session Memory (Engram): Maggy identifies 7 distinct amnesia pathologies (anterograde, retrograde, temporal, source, interference, context-binding, confabulation). Engram is a three-tier memory system — local (project-specific), portfolio (cross-project patterns), and mesh (team-shared). Knowledge compounds across sessions instead of evaporating. Maggy Mesh — P2P Team Intelligence: Connects Maggy instances across a team. One developer's CI fix becomes the entire team's knowledge — autonomously. Typed memory classes (scores, patterns, policies, gaps) with provenance and quarantine. A new team member gets the benefit of months of collective learning on day one. Multi-Model Routing: Auto-discovers which CLIs you have (Claude, Codex, Kimi, Ollama) by probing `--help` at startup. Routes by complexity score: Blast 1-3 → ollama (free, local) or kimi (cheap) Blast 4-6 → codex (mid-tier) Blast 7-10 → claude (premium, with validator) Security, tests, docs, architecture always go to Claude regardless. The routing rules are YAML and self-update from task outcomes. 5-Level Self-Improvement: This is the core differentiator. Every task teaches Maggy something: | Level | Frequency | What It Does | |-------|-----------|-------------| | L0 — Real-time | Seconds | Catches tool/test failures, switches models mid-task | | L1 — Task | Minutes | Computes reward score, updates model performance | | L2 — Daily | Hours | Catches CI pass rate drops, disables failing models | | L3 — Weekly | Days | Evolves skill files, adjusts workflow steps | | L4 — Monthly | Weeks | Recalibrates reward signals, tunes the improvement process itself | Budget Tracking: Per-provider token spend with daily limits. When Anthropic hits budget, Maggy routes to OpenAI. When that hits budget, it routes to local Qwen. Work never stops. Competitor Intelligence: RSS + Google News daily briefing for your competitive landscape. The benchmark Built an Expense Tracker (6 tasks) through two pipelines — Maggy (4 models) vs Claude Code alone: | Metric | Maggy | Claude Code | |--------|-------|-------------| | Success rate | 6/6 (100%) | 6/6 (100%) | | Quality score | 7.4/10 | 7.8/10 | | Claude usage | 1/6 tasks (17%) | 6/6 tasks (100%) | | Security issues found | 7 | 0 | Claude alone is faster. But Maggy used it for only 1 out of 6 tasks — 83% reduction in premium compute. And the dedicated security routing caught 7 issues the single-pipeline missed entirely. The question isn't "which tool writes better code today?" — it's "which tool writes better code *next month* than it did *this month*?" Repo: github.com/alinaqi/claude-bootstrap Maggy is built on Claude Code's infrastructure (skills, hooks, MCP). It extends Claude Code with self-improvement, multi-model routing, process intelligence, and team mesh. If you just want the skills/hooks/TDD se
View originalV-JEPA 2.1's dense features are partitioned: a robustness study across all four model sizes [R]
I ran a pre-registered robustness study on Meta's V-JEPA 2.1 across all four released model sizes (80M → 2B). 322-cell sweep Three findings worth flagging: 1. Dense features are partitioned. M2 (representational drift between clean and perturbed clips, measured as cosine distance on temporal-gradient vectors) predicts downstream task failure on DAVIS for temporal corruption (frame drops r=0.37 [0.30, 0.44], occlusion r=0.35 [0.28, 0.42]). For image-noise corruption, the correlation is statistically indistinguishable from zero (Gaussian r=−0.06, motion blur r=+0.09, low-light r=+0.05; all CIs cross zero). The two perturbation families are statistically separable at 95% confidence (closest CI gap +0.106). Aggregate r=0.16 [0.13, 0.20] is below both the pre-registered ambiguous threshold (0.30) and confirmation threshold (0.50). 2. Bigger is not reliably better. Every Tier 1 perturbation showed non-monotonic robustness. The 2B "gigantic" model is less robust than the 1B "giant" variant on three of the five perturbations. All jumps >5× their pooled CI half-width. 3. V-JEPA 2.1 is meaningfully orientation-sensitive. Horizontal flip preserves all temporal structure but disrupts representations comparably to playing the video backwards (M2 = 0.91 across all models vs. predicted upper bound of 0.30). Not orientation-equivariant out of the box. Six hypotheses pre-registered with explicit numerical decision rules. Two confirmed, three refuted, one partially withdrawn during analysis - the M1 component of H2 turned out to be ill-defined under reverse playback (M1 assumes preserved frame ordering, which time-axis perturbations break). Documented and not buried. Proposed mechanism for the non-monotonic scaling result: hub marginalization in deep ViTs (arXiv:2511.21635). Deeper models can over-shoot from "single hub aggregator" to a regime where extra layers scramble information rather than refine it. V-JEPA's dense predictive loss explicitly pushes against single-hub aggregation; if the 2B variant has crossed into the over-communication regime while the distilled 300M retains controlled mixing, the pattern is what hub marginalization predicts. Code, reproducibility manifest, raw shards: https://github.com/poisson-labs/vjepa-stress Full writeup: https://poissonlabs.ai/research/vjepa-2-1-robustness Happy to discuss methodology, the partitioning interpretation, or the hub-marginalization argument. The image-noise side of partitioning (gaussian/motion blur/low-light CIs all crossing zero) is the part I'd most like skeptical eyes on. submitted by /u/poisson_labs [link] [comments]
View originalI built a persistent memory MCP server for Claude Code (open source, Go, single binary)
Claude Code forgets everything between sessions. Same mistakes, same questions, same conventions re-explained. I built mnemos to fix that. It's an MCP server that gives Claude Code persistent memory across sessions. On session start, it pushes a ranked context block back into Claude: conventions you've established, corrections you've made before, skills it learned, hot files, recent session summaries. Next session starts already knowing what the last one figured out. What it does: Records corrections as tried / wrong_because / fix. Three corrections on the same topic auto-promote into a reusable skill with When this applies / Avoid / Do sections. No LLM in the loop, just deterministic pattern-mining, so it's reproducible and token-free. Bi-temporal store: facts carry valid/invalid timestamps, so "we used to use X, now Y" works without poisoning context with stale info. Compaction recovery: when Claude Code compacts mid-session, one tool call restores the goal and key decisions. Prompt-injection scanner at the write boundary, since memory stores are a new attack surface (instruction overrides, zero-width unicode, MCP spoofing). Retrospective replay: regenerate any past session as markdown with everything learned since layered in, paste it back to Claude, ask "what would I do differently now." Stack: Single static Go binary, 15 MB. No Python, no Docker, no vector DB, no CGO. SQLite + FTS5 for retrieval, optional cosine similarity if Ollama is running. Install (free, MIT, no paid tier): curl -fsSL https://raw.githubusercontent.com/polyxmedia/mnemos/main/scripts/install.sh | bash mnemos init mnemos init auto-wires Claude Code, Claude Desktop, Cursor, Windsurf, and Codex CLI. Restart your agent and the mnemos_* tools show up. GitHub: https://github.com/polyxmedia/mnemos Built it because I was tired of re-teaching Claude the same conventions every session. Happy to answer questions. submitted by /u/snozberryface [link] [comments]
View originalI built a video production pipeline with Claude - Integrates Live2D, Fish Audio, Sadtalker, and tons of other tools.
I've been working on a multi-agent AI pipeline that takes a topic (like "Ada Lovelace" or "The Cold War Space Race") and produces a complete, chapter-structured educational YouTube video, 15–20 minutes long. Here's what actually happens when you run it: You give it a persona (think: channel identity, tone, visual style) and a topic. From there, a chain of specialized agents handles everything: Script agents generate a chapter contract (outline + pacing plan), then write full narration for each chapter with timing built in. Asset agents generate matching visuals (images, B-roll) and sound design assets for each scene. Render agents (running on a Windows host with GPU) composite everything — narration audio, visuals, transitions, background music — into a finished video file. Upload agents push the result directly to YouTube with generated metadata. The pipeline is split across two environments: script and asset work runs in a Linux dev container (WSL), while rendering runs on the Windows host to access CUDA and video tooling. They talk over HTTP with a lightweight orchestrator coordinating state. The whole thing is phase-based — every step (W2.1, W4.3, R3.1, etc.) is independently re-runnable, so if your render fails or you want to rewrite chapter 3, you don't start over. Each phase reads and writes typed artifact files (JSON manifests, audio files, image directories) so agents are loosely coupled. It uses Claude as the core LLM for scripting, with structured prompts per persona to keep the voice consistent across episodes. Still early-stage but already producing watchable content. Here are the three major technical challenges and how they're solved: 1. Script Writing via Contract Architecture The core problem: how do you keep a 20-minute AI-written script narratively coherent across chapters written in separate LLM calls? The answer is a narrative contract (W2.1.a) — a validated JSON blueprint generated before any script text is written. It encodes four types of cross-chapter constraints: Threads — story arcs that must open in one chapter and close in another, with a declared payoff type (resolved, tragedy, etc.) Entities — named people/places with a forced first-introduction chapter, preventing retroactive mentions Facts Required — citations chained with dependencies (fact B can't appear until fact A is established) Timeline Anchors — temporal reference points that let non-linear structure (flashback, in-medias-res) stay internally consistent The contract is generated via an Opus → structural validate → Sonnet review loop (up to 3 rounds). Sonnet checks semantic coherence (no orphan entities, threads actually close), while the structural validator runs a Pydantic parse + temporal constraint check. Chapter writers downstream are bound to the contract — they can't invent threads or drop required facts. 2. Research via Fanout The research pipeline doesn't produce one outline — it produces several competing ones and eliminates losers. W1.11.a spins up N parallel OutlineAgent instances, each working from the same research package but on different thesis candidates. Each produces a three-level hierarchy: thesis → chapter arguments → scene beats. W1.12.a runs an independent grounding/revision loop on each branch: Grounding reviewer (Sonnet) flags blocking issues (claims contradicting cited facts) vs. polish issues (real facts exist but uncited) Revision agent applies fixes without restructuring Quality reviewer checks for structural failures (topical chapter lists, collapsed middles, summary endings) Up to 3 revision rounds per branch, all in parallel. W1.13.a runs a single judge agent that scores each refined outline on four axes: Axis Weight What it measures Concept Hook 0.40 CTR potential; title falsifiability Trap Closure 0.30 Protagonist's own logic creates complications (not external events) Opening Momentum 0.15 Cold-open quality — concrete moment vs. credentials/definitions Rewatch Anchor 0.15 One chapter that inverts the opening assumption sharply enough to quote The highest-scoring branch becomes Outline.json. The judge doesn't compare outlines against each other — it scores each independently to avoid anchoring bias. 3. Outline Creation and Evaluation The structural rules for a valid outline are unusually strict, based on observed failure modes: Six structural failure patterns the quality reviewer flags: No Narrative Spine — chapters are reorderable (topical list, not argument chain) Thesis Not Echoed — chapters cover topics instead of advancing the central claim Beats That Are States — "tension builds" instead of "character takes specific action" Vibes Chapter — emotionally evocative prose, vague beats Collapsed Middle — chapters 3–5 repeat the same narrative move Summary Ending — final chapter recaps instead of introducing new consequence Beat-level rules are similarly precise: each beat must name an actor, action, and datab
View originalYour Claude Code agent is always working from stale context. I built it a fix it can rewind, replay, and stay ahead of every edit.
Every long Claude Code session has the same hidden failure mode: the agent is always working from stale context. It re-reads the same 12 files across three sessions to "remind itself" of an interface you already showed it. It refactors getUserById without checking who calls it. It edits a config with no memory of why the previous version was that way. It's not the context window. The window is fine. There's no persistent, time-aware representation of your codebase for the agent to re-query. So it guesses. And you pay tokens for every re-read. I built Memtrace to fix exactly this. Two things it does that no other memory tool does: (1) Always-fresh state. Every edit you make triggers a 42ms incremental snapshot of the changes applied by the coding agent. The agent's memory is never one-session-old. After a refactor it knows the blast radius before you do: every caller, every test, every consumer of the function you just touched. Your agent stops asking "what does getUserById return?" 30 seconds after seeing it. (2) Rewind and replay. This is the part nobody else has. Your codebase is stored bi-temporally so every change becomes a recallable episode. When the agent debugs a regression, it can replay how the broken function got to its current state. What worked before. What changed when. Which commit introduced the bug Not just "guess from current state.", instead replay. My architectural bet that makes both possible: zero LLM inference during indexing. Tree-sitter parses your code into an AST, and the AST IS the structural representation. You don't pay an LLM to re-derive what your compiler already knows. Retrieval is hybrid. Tantivy BM25 for lexical recall (the "find getUserById" query). Jina-code 768-dim embeddings indexed in HNSW for semantic recall (the "find anything that authenticates a user" query). Two ranked lists, fused with Reciprocal Rank Fusion at k=60. One signal alone misses, together they hit. The embedding model matters here: Jina-code is trained on code, not generic prose, so the semantic side actually understands "this is an auth handler" instead of pattern-matching on the word "auth." The bi-temporal layer is what makes rewind possible. Every node and edge carries valid_time AND transaction_time, so "what did this function look like Monday" is a real query, not a git-blame heuristic. It's also what gives the agent the blast radius before a refactor: typed edges (CALLS, IMPORTS, IMPLEMENTS, EXTENDS, CONTAINS, TYPE_REFERENCES, INSTANTIATES) traversed in graph time, not text time. Speed only matters because freshness has to be cheap. If snapshotting after every edit is expensive, you can't afford to do it on every edit. So the indexing path is bottlenecked by I/O, not LLM tokens. I built it using Claude Code. Mid-build, Claude Code lost the plot on Memtrace's own architecture and it started contradicting decisions from 50 turns earlier. It re-read the same files. It forgot which retrieval weights I'd already tuned. I was experiencing the exact pain I was building Memtrace to solve, while building Memtrace. When the beta binary was ready, I pointed it at Memtrace's own codebase. The session-loss stopped. The blind refactor suggestions stopped. It's free, but the binary currently requires an approval key, just so you are warned. Not gatekeeping. Not marketing. The indexer keeps tripping on patterns I didn't anticipate: mixed pnpm/npm lockfiles, Rust proc-macros, Python Python TYPE_CHECKING blocks. Every one of these came from real beta users in the last two weeks, not from my test corpus. When that happens I want to ship you a fix in 24 hours, not lose you to a flaky first impression. So I'm pacing approvals to my own feedback bandwidth, not your patience. I'd rather have 500 users for whom this is magic than 50,000 for whom it's broken. I'm trying to keep approval under 24h, but capping at 50 per week right now. The benchmark harness is fully open and runnable without the key, if you want to verify the numbers before committing to the queue. Repo + waitlist: github.com/syncable-dev/memtrace-public Two questions: When Claude Code "loses the plot" on YOUR codebase, what specifically does it forget that hurts most? I'm collecting these for the next benchmark. What would you actually want to REWIND in your codebase if you could? Function history, dependency evolution, decision archaeology. Which is the killer one in your day? submitted by /u/WEEZIEDEEZIE [link] [comments]
View originalI spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]
For the past several years I've been quietly assembling and processing what I believe is one of the larger privately held pretraining corpora around... a complete Usenet archive spanning 1980 to 2013. Here's what it ended up being: 103.1 billion tokens (cl100k_base) 408 million posts across 9 newsgroup hierarchies 18,347 newsgroups covered 33 years of continuous coverage The processing pipeline included full deduplication, binary removal (alt.binaries.* excluded at the hierarchy level before record-level cleaning), quoted text handling, email address redaction via pattern matching and SHA-256 hashing of Message-IDs, and conversion from raw MBOX archives to gzip-compressed JSONL. Language detection was run on every record using Meta's fasttext LID-176. The corpus is 96.6% English with meaningful representation from 100+ other languages — the soc.culture.* groups in particular have high non-English density. The thing I find most interesting about this dataset from a training perspective is the temporal arc. Volume is sparse pre-1986, grows steadily through the early 90s, peaks around 1999–2000, then declines as Usenet gets displaced by forums and social media. That's a 33-year window of language evolution baked into a single coherent corpus — before SEO, before engagement optimization, before AI-generated content existed. I've published a full data card, cleaning methodology, and representative samples (5K posts per hierarchy + combined sets) on Hugging Face: https://huggingface.co/datasets/OwnedByDanes/Usenet-Corpus-1980-2013 Happy to answer questions about the processing pipeline or the data itself. submitted by /u/OwnerByDane [link] [comments]
View originalRepository Audit Available
Deep analysis of temporalio/temporal — architecture, costs, security, dependencies & more
Pricing found: $1,000, $100/mo, $500/mo, $30, $6,000
Key features include: Durable execution of workflows, Built-in error handling and retries, Scalable architecture for high reliability, Support for long-running processes, Versioning of workflows, Temporal Web UI for monitoring and debugging, Integration with existing codebases, Support for multiple programming languages.
Temporal is commonly used for: Orchestrating microservices, Managing complex workflows in cloud applications, Handling background jobs and tasks, Building reliable data pipelines, Automating business processes, Implementing event sourcing.
Temporal integrates with: AWS Lambda, Google Cloud Functions, Azure Functions, Kubernetes, Docker, PostgreSQL, MySQL, Redis, Kafka, Prometheus.
Temporal has a public GitHub repository with 19,256 stars.
Sam Rodriques
Co-founder and CEO at FutureHouse
2 mentions
Based on user reviews and social mentions, the most common pain points are: claude code cost.
Based on 71 social mentions analyzed, 17% of sentiment is positive, 76% neutral, and 7% negative.